100+ datasets found
  1. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  2. Open Machine Learning Projects

    • kaggle.com
    zip
    Updated Mar 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashant Banerjee (2020). Open Machine Learning Projects [Dataset]. https://www.kaggle.com/prashant111/open-machine-learning-projects
    Explore at:
    zip(4520 bytes)Available download formats
    Dataset updated
    Mar 14, 2020
    Authors
    Prashant Banerjee
    Description

    DESCRIPTION

    Information about popular open source projects related to machine learning.

    SUMMARY

    The goal of this dataset is to better undertand how open source machine learning projects evolve. Data collection date: early May 2018. Source: GitHub user interface and API. Contains original research.

    Presentation

    Columns

    name - name of the project. alignment - either corporate, academia or indie. Corporate projects are being developed by professional engineers, typically have a dedicated development team and trying to solve specific problems. Academical projects usually mention publications, they help to research. Independent projects are often a hobby. company - name of the company if the alignment is corporate. forecast - expected middle-term evolution of the project. 1 means positive, 0 means negative (stagnation) and -1 means factual death. year - when the project was created. Defaults to the GitHub repository creation date but can be earlier - this is a subject of manual adjustments. code of conduct - whether the project has a code of conduct. contributing - whether the project has a contributions guide. stars - number of stargazers on GitHub. issues - number of issues on GitHub, either open or closed. contributors - number of contributors as reported by GitHub. core - estimation of the core team aka "bus factor". team - number of people which commit to a project regularly. commits - number of commits in the project. team / all - ratio of the number of commits by the dedicated development team to the overall number of contributions. Indicates roughly which part of the project is own by the internal developers. link - URL of the project. language - API language. multi means several languages. implementation - the language which was mainly used for implementing the project. license - license of the project.

  3. Z

    Data from: A Dataset for GitHub Repository Deduplication

    • data-staging.niaid.nih.gov
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris (2020). A Dataset for GitHub Repository Deduplication [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3653919
    Explore at:
    Dataset updated
    Feb 9, 2020
    Dataset provided by
    University of Tennessee
    Athens University of Economics and Business
    Authors
    Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

    The dataset is provided as two files identifying GitHub repositories using the login-name/project-name convention. The file deduplicate_names contains 10,649,348 tab-separated records mapping a duplicated source project to a definitive target project.

    The file forks_clones_noise_names is a 50,324,363 member superset of the source projects, containing also projects that were excluded from the mapping as noise.

  4. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  5. Data from: A Benchmark Suite for Systematically Evaluating Reasoning...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano; Passerini Andrea; Passerini Andrea; Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano (2024). A Benchmark Suite for Systematically Evaluating Reasoning Shortcuts [Dataset]. http://doi.org/10.5281/zenodo.11612556
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano; Passerini Andrea; Passerini Andrea; Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Codebase [Github] | Dataset [Zenodo]

    Abstract

    The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available on Github.

    Usage

    We recommend visiting the official code website for instructions on how to use the dataset and accompaying software code.

    License

    All ready-made data sets and generated datasets are distributed under the CC-BY-SA 4.0 license, with the exception of Kand-Logic, which is derived from Kandinsky-patterns and as such is distributed under the GPL-3.0 license.

    Datasets Overview

    • CLIP-embeddings. This folder contains the saved activations from a pretrained CLIP model applied to the tested dataset. It includes embeddings that represent the dataset in a format suitable for further analysis and experimentation.
    • BDD_OIA-original-dataset. This directory holds the original files from the X-OIA project by Xu et al. [1]. These datasets have been made publicly available for ease of access and further research. If you are going to use it, please consider citing the original authors.
    • kand-logic-3k. This folder contains all images generated for the Kand-Logic project. Each image is accompanied by annotations for both concepts and labels.
    • bbox-kand-logic-3k. In this directory, you will find images from the Kand-Logic project that have undergone a preprocessing step. These images are extracted based on bounding boxes, rescaled, and include annotations for concepts and labels.
    • sdd-oia. This folder includes all images and labels generated using rsbench.
    • sdd-oia-embeddings. This directory contains 512-dimensional embeddings extracted from a pretrained ResNet18 model on ImageNet. The embeddings are derived from the sdd-oia`dataset.
    • BDD-OIA-preprocessed. Here you will find preprocessed data that follow the methodology outlined by Sawada and Nakamura [2]. The folder contains 2048-dimensional embeddings extracted from a pretrained Faster-RCNN model on the BDD-100k dataset.

    The original BDD datasets can be downloaded from the following Google Drive link: [Download BDD Dataset].

    References

    [1] Xu et al., *Explainable Object-Induced Action Decision for Autonomous Vehicles*, CVPR 2020.

    [2] Sawada and Nakamura, *Concept Bottleneck Model With Additional Unsupervised Concepts*, IEEE 2022.

  6. Materials and their Mechanical Properties

    • kaggle.com
    zip
    Updated Apr 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purushottam Nawale (2023). Materials and their Mechanical Properties [Dataset]. https://www.kaggle.com/datasets/purushottamnawale/materials
    Explore at:
    zip(145487 bytes)Available download formats
    Dataset updated
    Apr 15, 2023
    Authors
    Purushottam Nawale
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning

    To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.

    My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.

    A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014

  7. Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, json +3
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
    Explore at:
    zip, text/x-python, bin, json, txtAvailable download formats
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 26, 2024
    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    • A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains
      • 30 completely labeled (segmented) images
      • 71 partly labeled images
      • altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)
    • To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects
    • A set of metrics and a novel ranking score for respective meaningful method benchmarking
    • An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    >> FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    • fisbe_v1.0_{completely,partly}.zip
      • contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.
    • fisbe_v1.0_mips.zip
      • maximum intensity projections of all samples, for convenience.
    • sample_list_per_split.txt
      • a simple list of all samples and the subset they are in, for convenience.
    • view_data.py
      • a simple python script to visualize samples, see below for more information on how to use it.
    • dim_neurons_val_and_test_sets.json
      • a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.
    • Readme.md
      • general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
    For each image, we provide a pixel-wise instance segmentation for all separable neurons.
    Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
    The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
    The segmentation mask for each neuron is stored in a separate channel.
    The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9
    conda activate flylight-env

    How to open zarr files

    1. Install the python zarr package:
      pip install zarr
    2. Opened a zarr file with:

      import zarr
      raw = zarr.open(
      seg = zarr.open(

      # optional:
      import numpy as np
      raw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.
    Many functions that expect numpy arrays also work with zarr arrays.
    Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    1. Install napari:
      pip install "napari[all]"
    2. Save the following Python script:

      import zarr, sys, napari

      raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
      gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

      viewer = napari.Viewer(ndisplay=3)
      for idx, gt in enumerate(gts):
      viewer.add_labels(
      gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
      viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
      viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
      viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
      napari.run()

    3. Execute:
      python view_data.py 

    Metrics

    • S: Average of avF1 and C
    • avF1: Average F1 Score
    • C: Average ground truth coverage
    • clDice_TP: Average true positives clDice
    • FS: Number of false splits
    • FM: Number of false merges
    • tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
    For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe,
     title =    {FISBe: A real-world benchmark dataset for instance
             segmentation of long-range thin filamentous structures},
     author =    {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
             Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
             Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
     year =     2024,
     eprint =    {2404.00130},
     archivePrefix ={arXiv},
     primaryClass = {cs.CV}
    }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
    discussions.
    P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
    This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.
    All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

  8. d

    Synthea synthetic patient data for lung cancer risk prediction machine...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Synthea synthetic patient data for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/GD5XWE
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    This dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.

  9. OGBG-Code (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code
    Explore at:
    zip(1314604183 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    Description

    OGBN-Code

    Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

    Usage in Python

    from torch_geometric.data import DataLoader
    from ogb.graphproppred import PygGraphPropPredDataset
    
    dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') 
    
    batch_size = 32
    split_idx = dataset.get_idx_split()
    train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True)
    valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False)
    test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)
    

    Description

    Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

    Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

    Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

    Summary

    Package#Graphs#Nodes per Graph#Edges per GraphSplit TypeTask TypeMetric
    ogb>=1.2.0452,741125.2124.2ProjectSub-token predictionF1 score

    License: MIT License

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  10. Data from: A large-scale comparative analysis of Coding Standard conformance...

    • figshare.com
    application/x-gzip
    Updated Oct 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Oct 4, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978

  11. A Representative User-centric GitHub Developers Dataset for Malicious...

    • figshare.com
    png
    Updated Dec 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yushan Liu (2022). A Representative User-centric GitHub Developers Dataset for Malicious Account Detection [Dataset]. http://doi.org/10.6084/m9.figshare.21789566.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Dec 29, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Yushan Liu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.

    We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.

    Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.

  12. PVS - Passive Vehicular Sensors Datasets

    • kaggle.com
    zip
    Updated Jan 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeferson Menegazzo (2021). PVS - Passive Vehicular Sensors Datasets [Dataset]. https://www.kaggle.com/datasets/jefmenegazzo/pvs-passive-vehicular-sensors-datasets/discussion
    Explore at:
    zip(44498315084 bytes)Available download formats
    Dataset updated
    Jan 27, 2021
    Authors
    Jeferson Menegazzo
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    We strongly recommend that you read this content on the project page on GitHub, by clicking here

    Intelligent Vehicle Perception Based on Inertial Sensing and Artificial Intelligence

    This project aims to develop solutions for vehicular perception through inertial sensor signals and Artificial Intelligence models. Vehicular perception comprises exteroception and proprioception. Exteroception aims to understand the environment outside the vehicle, recognizing the road features on which it travels. These features include transient events in the form of anomalies and obstacles, such as potholes, cracks, speed bumps, etc.; and persistent events, such as surface type, conservation condition, and the road surface quality. On the other hand, proprioception aims to understand vehicular movements to identify their own behavior. These identifications can also be transient in the form of driving events, such as lane change, braking, skidding, aquaplaning, turning right or left; and persistent, as a safe or dangerous driving behavior profile. This situational information (perceptions) has wide applicability in Intelligent Transport Systems (ITS) such as Advanced Driver Assistance Systems (ADAS) and autonomous vehicles.

    For the development of this project, we collect nine datasets using GPS, camera, inertial sensors (accelerometers and gyroscopes), magnetometer, and temperature sensor. These data were produced with contextual variations including three different vehicles, driven by three different drivers, traveling through three different environments. To recognize and classify the vehicular perception patterns, we have developed several models based on Artificial Intelligence, among Classical Machine Learning and Deep Learning approaches. Below we describe the datasets produced, models developed and the results obtained, together with published scientific papers and source-codes.

    Table of Contents

    Vehicular Perception Research

    The project is active and we are currently developing new models for new perception pattern recognition. Below are described the research progress, in chronological order of research development. At the Research Gate you can also find the published scientific papers and request a full text for free.

    Research in English

    Vehicular Perception Based on Inertial Sensing: a Systematic Review

    In this paper, we describe the state-of-the-art vehicle perception produced through inertial sensors and Artificial Intelligence techniques. Through a literature review, we compiled the data extracted from the selected studies and described each paper in detail and chronological order of publication. Access here

    Vehicular Perception Based on Inertial Sensing: A Structured Mapping of Approaches and Methods

    In this paper, we present a structured literature mapping of the state-of-the-art vehicle perception produced through inertial sensors and Artificial Intelligence techniques. We describe a structured, approach, and technologies-oriented panorama of this field. Access here

    [Road Surface Type Classification Based on Inertial Sensors and Machine Learning: A Comparison Between Cla...

  13. Github-metadata-for-project-effort-estimates

    • kaggle.com
    zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Moreno (2025). Github-metadata-for-project-effort-estimates [Dataset]. https://www.kaggle.com/datasets/carlosxmoreno/github-metadata-for-project-effort-estimates
    Explore at:
    zip(6637856 bytes)Available download formats
    Dataset updated
    Nov 9, 2025
    Authors
    Carlos Moreno
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This dataset contains metadata for over 37,000 GitHub repositories, created for research on early-stage software project effort estimation.
    It includes 37 attributes describing repository size, activity, collaboration, licensing, and language usage.

    The dataset has been used with unsupervised machine learning models to analyze project segmentation according to different levels of effort.
    No labeled data related to project or product complexity is included.

    Data Collection

    The data was gathered using both the GitHub GraphQL API and the GitHub REST API.
    Repositories were included if they met all of the following criteria:

    1. They are software repositories.
    2. They are public.
    3. They were created in or after 2018.

    Most attributes were obtained via the GraphQL API.
    The contributors attribute is not accessible via GraphQL, so it was collected using the REST API for each repository and then merged with the rest of the metadata.

    Some attributes are synthetic features added during data analysis, such as:

    • language_count
    • reponame

    Attributes

    Below is a description of the 37 attributes included in the JSON schema:

    • name – Repository name.
    • description – Repository description.
    • stargazerCount – Number of stars the repository has received.
    • forkCount – Number of forks of the repository.
    • createdAt – Date and time the repository was created.
    • updatedAt – Date and time the repository was last updated.
    • pushedAt – Date and time of the last push to the repository.
    • diskUsage – Size of the repository in kilobytes.
    • isArchived – Indicates if the repository is archived (boolean).
    • isEmpty – Indicates if the repository is empty (boolean).
    • isFork – Indicates if the repository is a fork of another repository (boolean).
    • isInOrganization – Indicates if the repository belongs to an organization (boolean).
    • isPrivate – Indicates if the repository is private (boolean).
    • isTemplate – Indicates if the repository is a template (boolean).
    • hasIssuesEnabled – Indicates if issues are enabled in the repository (boolean).
    • hasWikiEnabled – Indicates if the wiki is enabled in the repository (boolean).
    • hasProjectsEnabled – Indicates if projects are enabled in the repository (boolean).
    • hasSponsorshipsEnabled – Indicates if sponsorships are enabled in the repository (boolean).
    • mergeCommitAllowed – Indicates if merge commits are allowed in pull requests (boolean).
    • viewerCanSubscribe – Indicates if the viewer can subscribe to notifications (boolean).
    • contributors – Number of contributors to the repository.
    • owner.login – Username of the repository owner.
    • owner.url – URL of the repository owner's profile.
    • licenseInfo.name – License type applied to the repository.
    • primaryLanguage.name – Primary programming language used in the repository.
    • languages.nodes – List of programming languages used in the repository.
    • issues – Total number of issues in the repository.
    • forks – Total number of forks of the repository.
    • assignableUsers – Total number of users who can be assigned issues or pull requests.
    • deployments – Total number of deployments for the repository.
    • environments – Total number of deployment environments.
    • milestones – Total number of milestones in the repository.
    • releases – Total number of releases in the repository.
    • pullRequests – Total number of pull requests in the repository.
    • watchers – Total number of watchers of the repository.
    • reponame – Full repository name in the format owner/repository.
    • language_count – Number of programming languages used in the repository.

    Files and structure

    • df_clean.json – Primary dataset containing metadata for over 37,000 public GitHub repositories used for model training and analysis.
    • df_new_records.json – Out-of-time validation sample with the same schema, used to evaluate model generalization performance (> 5.000 repositories)

    Both datasets share identical variable definitions and schema.

    Intended Use

    This dataset is intended for research and experimentation in:

    • Software project effort estimation
    • Repository analytics and software engineering metrics
    • Unsupervised learning for project segmentation and clustering
    • Studies on open-source project characteristics since 2018

    Limitations

    • Only public repositories created from 2018 onward are included.
    • No direct labels for effort or complexity are provided.
    • The dataset is a snapshot of GitHub at the time of collection; repository metadata may have changed since then.

    Privacy, Ethics, and Data P...

  14. Most Popular GitHub Projects

    • kaggle.com
    zip
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Most Popular GitHub Projects [Dataset]. https://www.kaggle.com/thedevastator/domain-analysis-of-5000-most-popular-github-repo
    Explore at:
    zip(287982 bytes)Available download formats
    Dataset updated
    Jan 4, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Most Popular GitHub Projects

    Popularity Factors and Growth Patterns

    By [source]

    About this dataset

    This dataset contains the characteristics of 5000 of the most popular GitHub repositories, based on their total number of stars. It provides a comprehensive overview of each repository's essential features like name, language, description, URL, and growth pattern. Additionally, it offers insight into how these properties factor into the popularity and success of each repository. This can be especially helpful in understanding how certain languages or patterns are more successful in particular use cases or scenarios compared to others. By better understanding these factors and patterns developers can create projects that best suit their needs while having a higher chance at achieving success on GitHub

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a comprehensive analysis of the domains of the most popular GitHub repositories, as measured by their total number of stars. It includes many valuable pieces of information that can be used to gain insight into current trends on the platform.

    In order to use this dataset to its fullest potential, it's important to understand each piece of data provided and how it can be used.

    Research Ideas

    • Comparing the popularity of various programming languages on GitHub.
    • Examining the most common topics and domains represented in top repositories, to better understand how developers use GitHub for their projects.
    • Identifying if certain growth patterns can be associated with higher popularity levels on GitHub, as measured by stars and forks

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: Domains of 5,000 GitHub Repositories - Public - Domains.csv | Column name | Description | |:-------------------|:----------------------------------------------------------------------------------------------------------------------------------| | Name | The name of the repository. (String) | | Stars | The total number of stars which serve as a metric to measure popularity. (Integer) | | Forks | The total number of forks which indicate how much collaboration there is on a project. (Integer) | | Language | The programming language used in the repository. (String) | | Description | A brief overview describing what the repository does and its features. (String) | | URL | The URL associated with that specific repository. (String) | | Domain | The domain or area within which this particular project works. For example, artificial intelligence or machine learning. (String) | | Growth Pattern | This property gives insight into whether the popularity has been increasing steadily or if it’s plateaued out etc. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  15. Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    ## Root directory

    - `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

    - `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)

    - `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    ## Dataset

    - `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed

    - `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library

    - `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model

    - `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project

    - `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    ## RQ1

    - `RQ1/RQ1_dataset-list.txt`: list of HF datasets

    - `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets

    - `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script

    - `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

    - `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`

    - `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

    ## RQ2

    - `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task

    - `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling

    - `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias

    - `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories

    - `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    ## RQ3

    - `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses

    - `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness

    - `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name

    - `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

    - `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)

    - `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

    ## scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  16. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  17. w

    Relationship and Entity Extraction Evaluation Dataset (Documents)

    • data.wu.ac.at
    • data.europa.eu
    json
    Updated Jan 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Defence Science and Technology Laboratory (2018). Relationship and Entity Extraction Evaluation Dataset (Documents) [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/MWM1MmZkY2UtZWE5Ni00MDIwLThlN2MtNTkxMmNjZWM4NWU5
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 20, 2018
    Dataset provided by
    Defence Science and Technology Laboratory
    Description

    This document dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'documents.json' file is to be used together with the 'entities.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.

  18. h

    git-diff_to_commit_msg

    • huggingface.co
    • kaggle.com
    Updated Oct 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Epasinghe (2025). git-diff_to_commit_msg [Dataset]. https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg
    Explore at:
    Dataset updated
    Oct 5, 2025
    Authors
    Epasinghe
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Hi, I’m Seniru Epasinghe 👋

    I’m an AI undergraduate and an AI enthusiast, working on machine learning projects and open-source contributions.I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.

      🌐 Connect with me
    
    
    
    
    
    
    
    
    
    
      There are 2 version of this dataset:
    

    git-diff_to_commit_msg - 1.5K rows huggingface link kaggle link

    git-diff_to_commit_msg_large - 1.75M rows huggingface link kaggle link… See the full description on the dataset page: https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg.

  19. Chicago Crime with Climate Data, 2021

    • kaggle.com
    zip
    Updated Dec 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Rozenberg (2021). Chicago Crime with Climate Data, 2021 [Dataset]. https://www.kaggle.com/datasets/markrozenberg/chicago-crime-with-climate-data-2021
    Explore at:
    zip(5305421 bytes)Available download formats
    Dataset updated
    Dec 24, 2021
    Authors
    Mark Rozenberg
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    In this project I used machine learning and deep learning multiclass classification algorithms to predict types of crime commited in the city of Chicago in 2021. Moreover, I added weather data as features to the models with hope that the last will enrich the models and improve predictions.

    project page on GitHub:

    https://github.com/Mark-Rozenberg/Crime-And-Climate

  20. I

    TextTransfer: Datasets for Impact Detection

    • databank.illinois.edu
    Updated Mar 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Becker; Kanyao Han; Antonina Werthmann; Rezvaneh Rezapour; Haejin Lee; Jana Diesner; Andreas Witt (2024). TextTransfer: Datasets for Impact Detection [Dataset]. http://doi.org/10.13012/B2IDB-9934303_V1
    Explore at:
    Dataset updated
    Mar 21, 2024
    Authors
    Maria Becker; Kanyao Han; Antonina Werthmann; Rezvaneh Rezapour; Haejin Lee; Jana Diesner; Andreas Witt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    German Federal Ministry of Education and Research
    Description

    Impact assessment is an evolving area of research that aims at measuring and predicting the potential effects of projects or programs. Measuring the impact of scientific research is a vibrant subdomain, closely intertwined with impact assessment. A recurring obstacle pertains to the absence of an efficient framework which can facilitate the analysis of lengthy reports and text labeling. To address this issue, we propose a framework for automatically assessing the impact of scientific research projects by identifying pertinent sections in project reports that indicate the potential impacts. We leverage a mixed-method approach, combining manual annotations with supervised machine learning, to extract these passages from project reports. This is a repository to save datasets and codes related to this project. Please read and cite the following paper if you would like to use the data: Becker M., Han K., Werthmann A., Rezapour R., Lee H., Diesner J., and Witt A. (2024). Detecting Impact Relevant Sections in Scientific Research. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING). This folder contains the following files: evaluation_20220927.ods: Annotated German passages (Artificial Intelligence, Linguistics, and Music) - training data annotated_data.big_set.corrected.txt: Annotated German passages (Mobility) - training data incl_translation_all.csv: Annotated English passages (Artificial Intelligence, Linguistics, and Music) - training data incl_translation_mobility.csv: Annotated German passages (Mobility) - training data ttparagraph_addmob.txt: German corpus (unannotated passages) model_result_extraction.csv: Extracted impact-relevant passages from the German corpus based on the model we trained rf_model.joblib: The random forest model we trained to extract impact-relevant passages Data processing codes can be found at: https://github.com/khan1792/texttransfer

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Organization logoOrganization logo

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE

Search
Clear search
Close search
Google apps
Main menu