100+ datasets found
  1. CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust (2022). CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type Inference Systems [Dataset]. http://doi.org/10.5281/zenodo.5747024
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains python repositories mined on GitHub on January 20, 2021. It allows a cross-domain evaluation of type inference systems. For this purpose, it consists of two sub-datasets, each containing only projects from the web or scientific calculation domain, respectively. Therefore we searched for projects with dependencies to either NumPy or Flask. Furthermore, only projects with dependencies to mypy were considered, because this should ensure that at least parts of the projects have type annotations. These can be used later as ground truth. Further details about the dataset will be described in an upcoming paper, as soon as it is published it will be linked here.
    The dataset consists of two files for the two sub-datasets. The web domain dataset contains 3129 repositories and the scientific calculation domain dataset contains 4783 repositories. The files have two columns with the URL to the GitHub repository and the used commit hash. Thus, it is possible to download the dataset using shell or python scripts, for example, the pipeline provided by ManyTypes4Py can be used.
    If repositories do not exist anymore or are private, you can contact us via the following email address: bernd.gruner@dlr.de. We have a backup of all repositories and will be happy to help you.

  2. Z

    Data from: Domain-adaptive Data Synthesis for Large-scale Supermarket...

    • data.niaid.nih.gov
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kampel, Martin (2024). Domain-adaptive Data Synthesis for Large-scale Supermarket Product Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7750241
    Explore at:
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Strohmayer, Julian
    Kampel, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition

    This repository contains the data synthesis pipeline and synthetic product recognition datasets proposed in [1].

    Data Synthesis Pipeline:

    We provide the Blender 3.1 project files and Python source code of our data synthesis pipeline pipeline.zip, accompanied by the FastCUT models used for synthetic-to-real domain translation models.zip. For the synthesis of new shelf images, a product assortment list and product images must be provided in the corresponding directories products/assortment/ and products/img/. The pipeline expects product images to follow the naming convention c.png, with c corresponding to a GTIN or generic class label (e.g., 9120050882171.png). The assortment list, assortment.csv, is expected to use the sample format [c, w, d, h], with c being the class label and w, d, and h being the packaging dimensions of the given product in mm (e.g., [4004218143128, 140, 70, 160]). The assortment list to use and the number of images to generate can be specified in generateImages.py (see comments). The rendering process is initiated by either executing load.py from within Blender or within a command-line terminal as a background process.

    Datasets:

    SG3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 851,801 instances of 3,234 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SG3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SGI3k - Synthetic GroZi-3.2k (SG3k) dataset, consisting of 10,000 synthetic shelf images with 838,696 instances of 1,063 GroZi-3.2k products. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SGI3kt - Domain-translated version of SGI3k, utilizing GroZi-3.2k as the target domain. Instance-level bounding boxes and generic class labels are provided for all product instances.

    SPS8k - Synthetic Product Shelves 8k (SPS8k) dataset, comprised of 16,224 synthetic shelf images with 1,981,967 instances of 8,112 supermarket products. Instance-level bounding boxes and GTIN class labels are provided for all product instances.

    SPS8kt - Domain-translated version of SPS8k, utilizing SKU110k as the target domain. Instance-level bounding boxes and GTIN class labels for all product instances.

    Table 1: Dataset characteristics.

    Dataset

    images

    products

    instances

    labels
    translation

    SG3k 10,000 3,234 851,801 bounding box & generic class¹ none

    SG3kt 10,000 3,234 851,801 bounding box & generic class¹ GroZi-3.2k

    SGI3k 10,000 1,063 838,696 bounding box & generic class² none

    SGI3kt 10,000 1,063 838,696 bounding box & generic class² GroZi-3.2k

    SPS8k 16,224 8,112 1,981,967 bounding box & GTIN none

    SPS8kt 16,224 8,112 1,981,967 bounding box & GTIN SKU110k

    Sample Format

    A sample consists of an RGB image (i.png) and an accompanying label file (i.txt), which contains the labels for all product instances present in the image. Labels use the YOLO format [c, x, y, w, h].

    ¹SG3k and SG3kt use generic pseudo-GTIN class labels, created by combining the GroZi-3.2k food product category number i (1-27) with the product image index j (j.jpg), following the convention i0000j (e.g., 13000097).

    ²SGI3k and SGI3kt use the generic GroZi-3.2k class labels from https://arxiv.org/abs/2003.06800.

    Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, Julian, and Martin Kampel. "Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition." International Conference on Computer Analysis of Images and Patterns. Cham: Springer Nature Switzerland, 2023.

    BibTeX citation:

    @inproceedings{strohmayer2023domain, title={Domain-Adaptive Data Synthesis for Large-Scale Supermarket Product Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={International Conference on Computer Analysis of Images and Patterns}, pages={239--250}, year={2023}, organization={Springer} }

  3. T

    visual_domain_decathlon

    • tensorflow.org
    Updated Aug 28, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). visual_domain_decathlon [Dataset]. https://www.tensorflow.org/datasets/catalog/visual_domain_decathlon
    Explore at:
    Dataset updated
    Aug 28, 2017
    Description

    This contains the 10 datasets used in the Visual Domain Decathlon, part of the PASCAL in Detail Workshop Challenge (CVPR 2017). The goal of this challenge is to solve simultaneously ten image classification problems representative of very different visual domains.

    Some of the datasets included here are also available as separate datasets in TFDS. However, notice that images were preprocessed for the Visual Domain Decathlon (resized isotropically to have a shorter size of 72 pixels) and might have different train/validation/test splits. Here we use the official splits for the competition.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('visual_domain_decathlon', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/visual_domain_decathlon-aircraft-1.2.0.png" alt="Visualization" width="500px">

  4. Z

    The dataset for the study of code change patterns in Python

    • data.niaid.nih.gov
    Updated Oct 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2021). The dataset for the study of code change patterns in Python [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4004117
    Explore at:
    Dataset updated
    Oct 19, 2021
    Dataset authored and provided by
    Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset of Python projects used for the study of code change patterns and their automation. The dataset lists 120 projects, divided into four domains — Web, Media, Data, and ML+DL.

  5. h

    Cybersec-Mutli-domain

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zain Nadeem, Cybersec-Mutli-domain [Dataset]. https://huggingface.co/datasets/ZainNadeem7/Cybersec-Mutli-domain
    Explore at:
    Authors
    Zain Nadeem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Creator: Zain NadeemRole: Python Django Developer | Software Engineer | Prompt Engineer | Ethical HackerLicense: CC BY 4.0Records: ~220,000Format: JSONLLanguage: English

      📌 Overview
    

    The CyberSec Multi-Domain Dataset is a structured collection of synthetic and open-source cybersecurity data across five important domains. It is designed for building, testing, and benchmarking machine learning models in cybersecurity, threat intelligence, and automation systems. This dataset helps… See the full description on the dataset page: https://huggingface.co/datasets/ZainNadeem7/Cybersec-Mutli-domain.

  6. Airlines Flights Data

    • kaggle.com
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science Lovers (2025). Airlines Flights Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/airlines-flights-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Science Lovers
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

    Airlines Flights Dataset for Different Cities

    The Flights Booking Dataset of various Airlines is a scraped datewise from a famous website in a structured format. The dataset contains the records of flight travel details between the cities in India. Here, multiple features are present like Source & Destination City, Arrival & Departure Time, Duration & Price of the flight etc.

    This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

    This analyse will be helpful for those working in Airlines, Travel domain.

    Using this dataset, we answered multiple questions with Python in our Project.

    Q.1. What are the airlines in the dataset, accompanied by their frequencies?

    Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.

    Q.3. Show Bar Graphs representing the Source City & Destination City.

    Q.4. Does price varies with airlines ?

    Q.5. Does ticket price change based on the departure time and arrival time?

    Q.6. How the price changes with change in Source and Destination?

    Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?

    Q.8. How does the ticket price vary between Economy and Business class?

    Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?

    These are the main Features/Columns available in the dataset :

    1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

    2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.

    3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.

    4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

    5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

    6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

    7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.

    8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

    9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

    10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

    11) Price: Target variable stores information of the ticket price.

  7. R

    Accompanying Dataset and Python Code for Reproducibility and Implementation...

    • entrepot.recherche.data.gouv.fr
    bin, hdf +2
    Updated Jan 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rémi Roncen; Rémi Roncen (2025). Accompanying Dataset and Python Code for Reproducibility and Implementation of the IR-TDIBC Method [Dataset]. http://doi.org/10.57745/TPWI2R
    Explore at:
    bin(1165), hdf(2970976), txt(2698577), text/x-python(11551), txt(2696692), text/x-python(6179), txt(2699255), bin(17228), text/x-python(7182), text/x-python(3087), txt(2698327), bin(3077), bin(1002), text/x-python(7583)Available download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Recherche Data Gouv
    Authors
    Rémi Roncen; Rémi Roncen
    License

    https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html

    Dataset funded by
    ERC
    Description

    Supporting Dataset and Python Code for TDIBC Method This repository provides the dataset and Python codes necessary to regenerate the figures presented in the manuscript "Revisiting Nonlinear Impedance in Acoustic Liners" (https://hal.science/hal-04810729v1) and to facilitate the proper implementation of the Impulse-Response Time-Domain Impedance Boundary Condition (IR-TDIBC) method. The materials aim to promote transparency, reproducibility, and accessibility for researchers working with nonlinear impedance models and acoustic liners in time domain. Contents Dataset: Includes data used in the manuscript, covering experimental measurements obtained in an impedance tube and flow noise obtained in the B2A bench at ONERA. Python Scripts: Scripts designed to: Recreate the figures from the paper Demonstrate the IR-TDIBC implementation step-by-step Features Scripts for generating plots and verifying results from the manuscript. Clear examples to help users adapt the IR-TDIBC method to their specific setups. Annotations and explanations within the code for ease of understanding and modification.

  8. T

    web_graph

    • tensorflow.org
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). web_graph [Dataset]. http://identifiers.org/arxiv:2112.02194
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    This dataset contains a sparse graph representing web link structure for a small subset of the Web.

    Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

    Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

    • We started with WAT files from June 2021 crawl.
    • Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.
    • To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.
    • These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.
    • Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.
    VersionTop level domainMin countNum nodesNum edges
    sparse10365.4M30B
    dense50136.5M22B
    de-sparsede1019.7M1.19B
    de-densede505.7M0.82B
    in-sparsein101.5M0.14B
    in-densein500.5M0.12B

    All versions of the dataset have following features:

    • "row_tag": a unique identifier of the row (source link).
    • "col_tag": a list of unique identifiers of non-zero columns (dest outlinks).
    • "gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('web_graph', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  9. o

    Dataset for interactive course on BioImage Analysis with Python (BIAPy)

    • explore.openaire.eu
    Updated May 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Witz (2020). Dataset for interactive course on BioImage Analysis with Python (BIAPy) [Dataset]. http://doi.org/10.5281/zenodo.3786306
    Explore at:
    Dataset updated
    May 5, 2020
    Authors
    Guillaume Witz
    Description

    This dataset can be used to run the course on image processing with Python available here: https://github.com/guiwitz/neubias_academy_biapy It combines microscopy images from different publicly available sources. All files are either in the Public Domain (PD) or released with a CC-BY license. The list of the original location of the data as well as their licenses can be found in the LICENSE file.

  10. e

    Python IFC Escape Route Model Generator - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Python IFC Escape Route Model Generator - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/60ea743b-3117-5ede-9401-8eb55028de31
    Explore at:
    Dataset updated
    Oct 27, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context This research data contains the Python project to generate escape route domain models in the IFC format created by researchers from the TU Wien Research Unit Digital Building Process.It is linked to the research paper: "Fischer, S., Urban, H., Schranz, C., Haselberger, M., & Schnabel, F. (2024). Generation of new BIM domain models from escape route analysis results. Developments in the Built Environment, 19, 100499. https://doi.org/10.1016/j.dibe.2024.100499" The research paper describes different ways of storing the results of escape route analysis in IFC models. Five different variants have been evaluated. This Python project contains the code to generate the most promising variant "Routes group Segments -- Group". The generated IFC models for all variants for a custom test model and a real-world model are also published, as well as the two initial models:Custom test model for escape route analysis in IFC format: https://doi.org/10.48436/hx8gz-zw339Real-world test model for escape route analysis in IFC format: https://doi.org/10.48436/fnmrh-crh59Custom escape route models in IFC format: https://doi.org/10.48436/dpwd5-33k50Real-World Escape Route Models in IFC format: https://doi.org/10.48436/rrd14-t1108 Technical details The project uses the Programming Language Python. The project was successfully executed with Python 3.10, 3.11, and 3.12. The most important library is IfcOpenShell (tested for versions 0.7.0 to 0.7.11). Instructions for downloading and installing IfcOpenShell can be found here: (https://docs.ifcopenshell.org/ifcopenshell-python/installation.html). Herein it is important to install the correct version compatible with the installed Python version. The input data is provided by JSON files containing the escape route data of the two initial IFC models. Instructions on how to use the code are included in the README.md file in the zip folder. All data files are licensed under CC BY 4.0, all software files are licensed under MIT License. The IFC Escape Route Model Generator is also available for the JavaScript programming language: https://doi.org/10.48436/c35ty-ky950

  11. h

    clustered_tulu_3_16

    • huggingface.co
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malikeh Ehghaghi (2025). clustered_tulu_3_16 [Dataset]. https://huggingface.co/datasets/Malikeh1375/clustered_tulu_3_16
    Explore at:
    Dataset updated
    Jul 28, 2025
    Authors
    Malikeh Ehghaghi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Clustered_Tulu_3_16 Multi-Domain Dataset

    This dataset contains high-quality examples across 16 specialized domains, automatically extracted and curated from the Tulu-3 SFT mixture using advanced clustering techniques.

      🎯 Multi-Domain Structure
    

    This repository provides 16 domain-specific configurations, each optimized for different types of tasks:

    Configuration Domain Train Test Total

    python_string_and_list_processing Python String & List Processing 43,564 10… See the full description on the dataset page: https://huggingface.co/datasets/Malikeh1375/clustered_tulu_3_16.

  12. Data associated with manuscript "Spatial Frequency domain Mueller matrix...

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +2more
    Updated Jan 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Data associated with manuscript "Spatial Frequency domain Mueller matrix imaging" [Dataset]. https://catalog.data.gov/dataset/data-associated-with-manuscript-spatial-frequency-domain-mueller-matrix-imaging-43bb2
    Explore at:
    Dataset updated
    Jan 7, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This archive contains the spatial frequency domain Mueller matrix data associated with the paper J. Chue-Sang, M. Litorja, A. M. Goldfain, and T. A. Germer, "Spatial Frequency domain Mueller matrix imaging," J. Biomedical Optics 27(12), 126003 (2022). The paper shows a subset of the data included in this archive. A Python script, analyze.py, is provided to assist the user in reading the data. The script can be run without any arguments from the top folder and will generate all the figures included in this archive. The script requires Python 3.6 and Matplotlib 3.4.A MATLAB script, analyze.m, is also provided.

  13. h

    staqc

    • huggingface.co
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles Koutcheme (2023). staqc [Dataset]. https://huggingface.co/datasets/koutch/staqc
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2023
    Authors
    Charles Koutcheme
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    StaQC (Stack Overflow Question-Code pairs) is a dataset of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from Stack Overflow using a Bi-View Hierarchical Neural Network, as described in the paper "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow" (WWW'18).

  14. T

    rlu_control_suite

    • tensorflow.org
    Updated Nov 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). rlu_control_suite [Dataset]. https://www.tensorflow.org/datasets/catalog/rlu_control_suite
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.

    The datasets follow the RLDS format to represent steps and episodes.

    DeepMind Control Suite Tassa et al., 2018 is a set of control tasks implemented in MuJoCo Todorov et al., 2012. We consider a subset of the tasks provided in the suite that cover a wide range of difficulties.

    Most of the datasets in this domain are generated using D4PG. For the environments Manipulator insert ball and Manipulator insert peg we use V-MPO Song et al., 2020 to generate the data as D4PG is unable to solve these tasks. We release datasets for 9 control suite tasks. For details on how the dataset was generated, please refer to the paper.

    DeepMind Control Suite is a traditional continuous action RL benchmark. In particular, we recommend you test your approach in DeepMind Control Suite if you are interested in comparing against other state of the art offline RL methods.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('rlu_control_suite', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  15. e

    Dataset relaterat till processövervakning och tillståndsövervakning av en...

    • b2find.eudat.eu
    Updated Oct 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset relaterat till processövervakning och tillståndsövervakning av en lagerringsslipmaskin - Dataset for the Implementation of Condition-based Maintenance and Maintenance Decision-making of a Bearing Ring Grinder - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/aa8255c2-9170-51a8-9dd7-00e1514510dc
    Explore at:
    Dataset updated
    Oct 10, 2024
    Description

    In the article (Ahmer, M., Sandin, F., Marklund, P. et al., 2022), we have investigated the effective use of sensors in a bearing ring grinder for failure classification in the condition-based maintenance context. The proposed methodology combines domain knowledge of process monitoring and condition monitoring to successfully achieve failure mode prediction with high accuracy using only a few key sensors. This enables manufacturing equipment to take advantage of advanced data processing and machine learning techniques. The grinding machine is of type SGB55 from Lidköping Machine Tools and is used to produce functional raceway surface of inner rings of type SKF-6210 deep groove ball bearing. Additional sensors like vibration, acoustic emission, force, and temperature sensors are installed to monitor machine condition while producing bearing components under different operating conditions. Data is sampled from sensors as well as the machine's numerical controller during operation. Selected parts are measured for the produced quality. Ahmer, M., Sandin, F., Marklund, P., Gustafsson, M., & Berglund, K. (2022). Failure mode classification for condition-based maintenance in a bearing ring grinding machine. In The International Journal of Advanced Manufacturing Technology (Vol. 122, pp. 1479–1495). https://doi.org/10.1007/s00170-022-09930-6 The files are of three categories and are grouped in zipped folders. The pdf file named "readme_data_description.pdf" describes the content of the files in the folders. The "lib" includes the information on libraries to read the .tdms Data Files in Matlab or Python. The raw time-domain sensors signal data are grouped in seven main folders named after each test run e.g. "test_1"... "test_7". Each test includes seven dressing cycles named e.g. "dresscyc_1"... "dresscyc_7". Each dressing cycle includes .tdms files for fifteen rings for their individual grinding cycle. The column description for both "Analogue" and "Digital" channels are described in the "readme_data_description.pdf" file. The machine and process parameters used for the tests as sampled from the machine's control system (Numerical Controller) and compiled for all test runs in a single file "process_data.csv" in the folder "proc_param". The column description is available in "readme_data_description.pdf" under "Process Parameters". The measured quality data (nine quality parameters - normalized) of the selected produced parts are recorded in the file "measured_quality_param.csv" under folder "quality". The description of the quality parameters is available in "readme_data_description.pdf". The quality parameter disposition based on their actual acceptance tolerances for the process step is presented in file "quality_disposition.csv" under folder "quality". I publikationen (Ahmer, M., Sandin, F., Marklund, P. et al., 2022) har vi undersökt användningen av sensorer i en lagerringsslipmaskin för felklassificering och tillståndsövervakning. Föreslagen metod kombinerar domänkunskap om processövervakning och tillståndsövervakning för att framgångsrikt uppnå fellägesförutsägelse med hög noggrannhet med endast ett fåtal nyckelsensorer. Denna forskning visar att tillverkningsutrustning kan dra fördel av avancerad databehandling och maskininlärningsteknik. Slipmaskinen är av typ SGB55 från Lidköping Machine Tools och används i detta fall för att slipa löpbanor på lagerinnerringar av typ SKF-6210 spårkullager. Sensorer för vibration, akustisk emission, kraft och temperatur är installerade för att övervaka maskinens tillstånd under slipning och olika driftsförhållanden. Data insamlas från sensorerna samt maskinens numeriska styrenhet under drift. Utvalda producerade kvalitetsparametrar mäts efter slipoperationen. Ahmer, M., Sandin, F., Marklund, P., Gustafsson, M., & Berglund, K. (2022). Failure mode classification for condition-based maintenance in a bearing ring grinding machine. In The International Journal of Advanced Manufacturing Technology (Vol. 122, pp. 1479–1495). https://doi.org/10.1007/s00170-022-09930-6 Filerna är grupperade i mappar i zip-filer. Pdf-filen "readme_data_description.pdf" beskriver innehållet i filerna i mapparna. "lib" innehåller information om bibliotek som kan användas för att läsa .tdms-datafilerna i Matlab eller Python. Se den engelska beskrivningen för mer information. Raw time series data collected from machine and sensors during production of bearing rings and bearing rings quality measurement data. Rå tidsseriedata insamlad från maskin och sensorer under tillverkning av lagerringar och lagerringar kvalitetsmätdata.

  16. H

    Using Python Packages and HydroShare to Advance Open Data Science and...

    • beta.hydroshare.org
    • hydroshare.org
    zip
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery S. Horsburgh; Amber Spackman Jones; Anthony M. Castronova; Scott Black (2023). Using Python Packages and HydroShare to Advance Open Data Science and Analytics for Water [Dataset]. https://beta.hydroshare.org/resource/4f4acbab5a8c4c55aa06c52a62a1d1fb/
    Explore at:
    zip(31.0 MB)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    HydroShare
    Authors
    Jeffery S. Horsburgh; Amber Spackman Jones; Anthony M. Castronova; Scott Black
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scientific and management challenges in the water domain require synthesis of diverse data. Many data analysis tasks are difficult because datasets are large and complex; standard data formats are not always agreed upon or mapped to efficient structures for analysis; scientists may lack training for tackling large and complex datasets; and it can be difficult to share, collaborate around, and reproduce scientific work. Overcoming barriers to accessing, organizing, and preparing datasets for analyses can transform the way water scientists work. Building on the HydroShare repository’s cyberinfrastructure, we have advanced two Python packages that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS) (i.e., a Python equivalent of USGS’ R dataRetrieval package), loading data into performant structures that integrate with existing visualization, analysis, and data science capabilities available in Python, and writing analysis results back to HydroShare for sharing and publication. While these Python packages can be installed for use within any Python environment, we will demonstrate how the technical burden for scientists associated with creating a computational environment for executing analyses can be reduced and how sharing and reproducibility of analyses can be enhanced through the use of these packages within CUAHSI’s HydroShare-linked JupyterHub server.

    This HydroShare resource includes all of the materials presented in a workshop at the 2023 CUAHSI Biennial Colloquium.

  17. OpenMIIR - a public domain dataset of EEG recordings for music imagery...

    • figshare.com
    zip
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Stober; Avital Sternin; Adrian M. Owen; Jessica A. Grahn (2023). OpenMIIR - a public domain dataset of EEG recordings for music imagery information retrieval [Dataset]. http://doi.org/10.6084/m9.figshare.1541151.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Sebastian Stober; Avital Sternin; Adrian M. Owen; Jessica A. Grahn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Music imagery information retrieval (MIIR) systems may one day be able to recognize a song just as we think of it. As a step towards such technology, we are presenting a public domain dataset of electroencephalography (EEG) recordings taken during music perception and imagination. We acquired this data during an ongoing study that so far comprised 10 subjects listening to and imagining 12 short music fragments - each 7s-16s long - taken from well-known pieces. These stimuli were selected from different genres and systematically span several musical dimensions such as meter, tempo and the presence of lyrics. This way, various retrieval and classification scenarios can be addressed. The dataset is primarily aimed to enable music information retrieval researchers interested in these new MIIR challenges to easily test and adapt their existing approaches for music analysis like fingerprinting, beat tracking or tempo estimation on this new kind of data. We also hope that the OpenMIIR dataset will facilitate a stronger interdisciplinary collaboration between music information retrieval researchers and neuroscientists.

  18. excersice 4 python

    • kaggle.com
    Updated Jun 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthi.Mastrogiannaki (2018). excersice 4 python [Dataset]. https://www.kaggle.com/anthi1984/excersice-4-python/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anthi.Mastrogiannaki
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Anthi.Mastrogiannaki

    Released under CC0: Public Domain

    Contents

  19. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    • b2find.eudat.eu
    zip
    Updated Jun 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset paper (public preprint)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the Soil Moisture Climate Data Records from satellites community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  20. MatSeg: Material State Segmentation Dataset and Benchmark

    • zenodo.org
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). MatSeg: Material State Segmentation Dataset and Benchmark [Dataset]. http://doi.org/10.5281/zenodo.11331618
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    MatSeg Dataset and benchmark for zero-shot material state segmentation.

    MatSeg Benchmark containing 1220 real-world images and their annotations is available at MatSeg_Benchmark.zip the file contains documentation and Python readers.

    MatSeg dataset containing synthetic images with infused natural images patterns is available at MatSeg3D_part_*.zip and MatSeg3D_part_*.zip (* stand for number).

    MatSeg3D_part_*.zip: contain synthethc 3D scenes

    MatSeg2D_part_*.zip: contain syntethc 2D scenes

    Readers and documentation for the synthetic data are available at: Dataset_Documentation_And_Readers.zip

    Readers and documentation for the real-images benchmark are available at: MatSeg_Benchmark.zip

    The Code used to generate the MatSeg Dataset is available at: https://zenodo.org/records/11401072

    Additional permanent sources for downloading the dataset and metadata: 1, 2

    Evaluation scripts for the Benchmark are now available at:

    https://zenodo.org/records/13402003 and https://e.pcloud.link/publink/show?code=XZsP8PZbT7AJzG98tV1gnVoEsxKRbBl8awX

    Description

    Materials and their states form a vast array of patterns and textures that define the physical and visual world. Minerals in rocks, sediment in soil, dust on surfaces, infection on leaves, stains on fruits, and foam in liquids are some of these almost infinite numbers of states and patterns.

    Image segmentation of materials and their states is fundamental to the understanding of the world and is essential for a wide range of tasks, from cooking and cleaning to construction, agriculture, and chemistry laboratory work.

    The MatSeg dataset focuses on zero-shot segmentation of materials and their states, meaning identifying the region of an image belonging to a specific material type of state, without previous knowledge or training of the material type, states, or environment.

    The dataset contains a large set of (100k) synthetic images and benchmarks of 1220 real-world images for testing.

    Benchmark

    The benchmark contains 1220 real-world images with a wide range of material states and settings. For example: food states (cooked/burned..), plants (infected/dry.) to rocks/soil (minerals/sediment), construction/metals (rusted, worn), liquids (foam/sediment), and many other states in without being limited to a set of classes or environment. The goal is to evaluate the segmentation of material materials without knowledge or pretraining on the material or setting. The focus is on materials with complex scattered boundaries, and gradual transition (like the level of wetness of the surface).

    Evaluation scripts for the Benchmark are now available at: 1 and 2.

    Synthetic Dataset

    The synthetic dataset is composed of synthetic scenes rendered in 2d and 3d using a blender. The synthetic data is infused with patterns, materials, and textures automatically extracted from real images allowing it to capture the complexity and diversity of the real world while maintaining the precision and scale of synthetic data. 100k images and their annotation are available to download.

    License

    This dataset, including all its components, is released under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. To the extent possible under law, the authors have dedicated all copyright and related and neighboring rights to this dataset to the public domain worldwide. This dedication applies to the dataset and all derivative works.

    The MatSeg 2D and 3D synthetic were generated using the open-images dataset which is licensed under the https://www.apache.org/licenses/LICENSE-2.0. For these components, you must comply with the terms of the Apache License. In addition, the MatSege3D dataset uses Shapenet 3D assets with GNU license.

    Example Usage:

    An Example of a training and evaluation code for a net trained on the dataset and evaluated on the benchmark is given at these urls: 1, 2

    This include an evaluation script on the MatSeg benchmark.

    Training script using the MatSeg dataset.

    And weights of a trained model

    Paper:

    More detail on the work ca be found in the paper "Infusing Synthetic Data with Real-World Patterns for
    Zero-Shot Material State Segmentation"

    Croissant metadata and additional sources for downloading the dataset are available at 1,2

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust (2022). CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type Inference Systems [Dataset]. http://doi.org/10.5281/zenodo.5747024
Organization logo

CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type Inference Systems

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
binAvailable download formats
Dataset updated
Jan 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bernd Gruner; Bernd Gruner; Thomas Heinze; Thomas Heinze; Clemens-Alexander Brust; Clemens-Alexander Brust
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains python repositories mined on GitHub on January 20, 2021. It allows a cross-domain evaluation of type inference systems. For this purpose, it consists of two sub-datasets, each containing only projects from the web or scientific calculation domain, respectively. Therefore we searched for projects with dependencies to either NumPy or Flask. Furthermore, only projects with dependencies to mypy were considered, because this should ensure that at least parts of the projects have type annotations. These can be used later as ground truth. Further details about the dataset will be described in an upcoming paper, as soon as it is published it will be linked here.
The dataset consists of two files for the two sub-datasets. The web domain dataset contains 3129 repositories and the scientific calculation domain dataset contains 4783 repositories. The files have two columns with the URL to the GitHub repository and the used commit hash. Thus, it is possible to download the dataset using shell or python scripts, for example, the pipeline provided by ManyTypes4Py can be used.
If repositories do not exist anymore or are private, you can contact us via the following email address: bernd.gruner@dlr.de. We have a backup of all repositories and will be happy to help you.

Search
Clear search
Close search
Google apps
Main menu