100+ datasets found

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
Open Machine Learning Projects
kaggle.com
zip
Updated Mar 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashant Banerjee (2020). Open Machine Learning Projects [Dataset]. https://www.kaggle.com/prashant111/open-machine-learning-projects
Explore at:
zip(4520 bytes)Available download formats
Dataset updated
Mar 14, 2020
Authors
Prashant Banerjee
Description
DESCRIPTION

Information about popular open source projects related to machine learning.

SUMMARY

The goal of this dataset is to better undertand how open source machine learning projects evolve. Data collection date: early May 2018. Source: GitHub user interface and API. Contains original research.

Presentation

Columns

name - name of the project. alignment - either corporate, academia or indie. Corporate projects are being developed by professional engineers, typically have a dedicated development team and trying to solve specific problems. Academical projects usually mention publications, they help to research. Independent projects are often a hobby. company - name of the company if the alignment is corporate. forecast - expected middle-term evolution of the project. 1 means positive, 0 means negative (stagnation) and -1 means factual death. year - when the project was created. Defaults to the GitHub repository creation date but can be earlier - this is a subject of manual adjustments. code of conduct - whether the project has a code of conduct. contributing - whether the project has a contributions guide. stars - number of stargazers on GitHub. issues - number of issues on GitHub, either open or closed. contributors - number of contributors as reported by GitHub. core - estimation of the core team aka "bus factor". team - number of people which commit to a project regularly. commits - number of commits in the project. team / all - ratio of the number of commits by the dedicated development team to the overall number of contributions. Indicates roughly which part of the project is own by the internal developers. link - URL of the project. language - API language. multi means several languages. implementation - the language which was mainly used for implementing the project. license - license of the project.
Z
Data from: A Dataset for GitHub Repository Deduplication
data-staging.niaid.nih.gov
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris (2020). A Dataset for GitHub Repository Deduplication [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3653919
Explore at:
Dataset updated
Feb 9, 2020
Dataset provided by
University of Tennessee
Athens University of Economics and Business
Authors
Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

The dataset is provided as two files identifying GitHub repositories using the login-name/project-name convention. The file deduplicate_names contains 10,649,348 tab-separated records mapping a duplicated source project to a definitive target project.

The file forks_clones_noise_names is a 50,324,363 member superset of the source projects, containing also projects that were excluded from the mapping as noise.
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zipAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Data from: A Benchmark Suite for Systematically Evaluating Reasoning...
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano; Passerini Andrea; Passerini Andrea; Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano (2024). A Benchmark Suite for Systematically Evaluating Reasoning Shortcuts [Dataset]. http://doi.org/10.5281/zenodo.11612556
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11612556
Dataset updated
Jun 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano; Passerini Andrea; Passerini Andrea; Bortolotti Samuele; Marconato Emanuele; Carraro Tommaso; Morettin Paolo; van Krieken Emile; Vergari Antonio; Teso Stefano
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Codebase [Github] | Dataset [Zenodo]

Abstract

The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available on Github.

Usage

We recommend visiting the official code website for instructions on how to use the dataset and accompaying software code.

License

All ready-made data sets and generated datasets are distributed under the CC-BY-SA 4.0 license, with the exception of Kand-Logic, which is derived from Kandinsky-patterns and as such is distributed under the GPL-3.0 license.

Datasets Overview

CLIP-embeddings. This folder contains the saved activations from a pretrained CLIP model applied to the tested dataset. It includes embeddings that represent the dataset in a format suitable for further analysis and experimentation.

BDD_OIA-original-dataset. This directory holds the original files from the X-OIA project by Xu et al. [1]. These datasets have been made publicly available for ease of access and further research. If you are going to use it, please consider citing the original authors.

kand-logic-3k. This folder contains all images generated for the Kand-Logic project. Each image is accompanied by annotations for both concepts and labels.

bbox-kand-logic-3k. In this directory, you will find images from the Kand-Logic project that have undergone a preprocessing step. These images are extracted based on bounding boxes, rescaled, and include annotations for concepts and labels.

sdd-oia. This folder includes all images and labels generated using rsbench.

sdd-oia-embeddings. This directory contains 512-dimensional embeddings extracted from a pretrained ResNet18 model on ImageNet. The embeddings are derived from the sdd-oia`dataset.

BDD-OIA-preprocessed. Here you will find preprocessed data that follow the methodology outlined by Sawada and Nakamura [2]. The folder contains 2048-dimensional embeddings extracted from a pretrained Faster-RCNN model on the BDD-100k dataset.

The original BDD datasets can be downloaded from the following Google Drive link: [Download BDD Dataset].

References

[1] Xu et al., *Explainable Object-Induced Action Decision for Autonomous Vehicles*, CVPR 2020.

[2] Sawada and Nakamura, *Concept Bottleneck Model With Additional Unsupervised Concepts*, IEEE 2022.
Materials and their Mechanical Properties
kaggle.com
zip
Updated Apr 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Purushottam Nawale (2023). Materials and their Mechanical Properties [Dataset]. https://www.kaggle.com/datasets/purushottamnawale/materials
Explore at:
zip(145487 bytes)Available download formats
Dataset updated
Apr 15, 2023
Authors
Purushottam Nawale
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning

To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.

My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.

A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
zenodo.org
data.niaid.nih.gov
+1more
bin, json +3
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
Explore at:
zip, text/x-python, bin, json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10875063
Dataset updated
Apr 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 26, 2024
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

>> FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env

How to open zarr files

Install the python zarr package:
pip install zarr

Opened a zarr file with:

import zarr
raw = zarr.open(
seg = zarr.open(

# optional:
import numpy as np
raw_np = np.array(raw)

Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:
pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()

Execute:
python view_data.py

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.
All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
d
Synthea synthetic patient data for lung cancer risk prediction machine...
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, AJ (2023). Synthea synthetic patient data for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/GD5XWE
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/GD5XWE
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Chen, AJ
Description
This dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.
OGBG-Code (Processed for PyG)
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redao da Taupl (2021). OGBG-Code (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbg-code/code
Explore at:
zip(1314604183 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Redao da Taupl
Description
OGBN-Code

Webpage: https://ogb.stanford.edu/docs/graphprop/#ogbg-code

Usage in Python

from torch_geometric.data import DataLoader from ogb.graphproppred import PygGraphPropPredDataset dataset = PygGraphPropPredDataset(name = 'ogbg-code', root = '/kaggle/input') batch_size = 32 split_idx = dataset.get_idx_split() train_loader = DataLoader(dataset[split_idx['train']], batch_size = batch_size, shuffle = True) valid_loader = DataLoader(dataset[split_idx['valid']], batch_size = batch_size, shuffle = False) test_loader = DataLoader(dataset[split_idx['test']], batch_size = batch_size, shuffle = False)

Description

Graph: The ogbg-code dataset is a collection of Abstract Syntax Trees (ASTs) obtained from approximately 450 thousands Python method definitions. Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub. The collection of Python methods originates from GitHub CodeSearchNet, a collection of datasets and benchmarks for machine-learning-based code retrieval. In ogbg-code, the dataset authors contribute an additional feature extraction step, which includes: AST edges, AST nodes, and tokenized method name. Altogether, ogbg-code allows you to capture source code with its underlying graph structure, beyond its token sequence representation.

Prediction task: The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description (i.e., the method name chosen by the developer) for a complete logical unit (i.e., the method body). Code summarization is a representative task in the field of machine learning for code not only for its straightforward adoption in developer tools, but also because it is a proxy measure for assessing how well a model captures the code semantic [1]. Following [2,3], the dataset authors use an F1 score to evaluate predicted sub-tokens against ground-truth sub-tokens.

Dataset splitting: The dataset authors adopt a project split [4], where the ASTs for the train set are obtained from GitHub projects that do not appear in the validation and test sets. This split respects the practical scenario of training a model on a large collection of source code (obtained, for instance, from the popular GitHub projects), and then using it to predict method names on a separate code base. The project split stress-tests the model’s ability to capture code’s semantics, and avoids a model that trivially memorizes the idiosyncrasies of training projects (such as the naming conventions and the coding style of a specific developer) to achieve a high test score.

Summary

Package #Graphs #Nodes per Graph #Edges per Graph Split Type Task Type Metric
ogb>=1.2.0 452,741 125.2 124.2 Project Sub-token prediction F1 score

License: MIT License

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [5] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machinelearning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, 2018. [2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating sequences fromstructured representations of code. arXiv preprint arXiv:1808.01400, 2018. [3] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed rep-resentations of code. Proceedings of the ACM on Programming Languages, 3(POPL):1–29,2019. [4] Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 143–153, 2019. [5] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Data from: A large-scale comparative analysis of Coding Standard conformance...
figshare.com
application/x-gzip
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12377237.v3
Dataset updated
Oct 4, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978
A Representative User-centric GitHub Developers Dataset for Malicious...
figshare.com
png
Updated Dec 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yushan Liu (2022). A Representative User-centric GitHub Developers Dataset for Malicious Account Detection [Dataset]. http://doi.org/10.6084/m9.figshare.21789566.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21789566.v1
Dataset updated
Dec 29, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Yushan Liu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.

We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.

Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.
PVS - Passive Vehicular Sensors Datasets
kaggle.com
zip
Updated Jan 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeferson Menegazzo (2021). PVS - Passive Vehicular Sensors Datasets [Dataset]. https://www.kaggle.com/datasets/jefmenegazzo/pvs-passive-vehicular-sensors-datasets/discussion
Explore at:
zip(44498315084 bytes)Available download formats
Dataset updated
Jan 27, 2021
Authors
Jeferson Menegazzo
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
We strongly recommend that you read this content on the project page on GitHub, by clicking here

Intelligent Vehicle Perception Based on Inertial Sensing and Artificial Intelligence

This project aims to develop solutions for vehicular perception through inertial sensor signals and Artificial Intelligence models. Vehicular perception comprises exteroception and proprioception. Exteroception aims to understand the environment outside the vehicle, recognizing the road features on which it travels. These features include transient events in the form of anomalies and obstacles, such as potholes, cracks, speed bumps, etc.; and persistent events, such as surface type, conservation condition, and the road surface quality. On the other hand, proprioception aims to understand vehicular movements to identify their own behavior. These identifications can also be transient in the form of driving events, such as lane change, braking, skidding, aquaplaning, turning right or left; and persistent, as a safe or dangerous driving behavior profile. This situational information (perceptions) has wide applicability in Intelligent Transport Systems (ITS) such as Advanced Driver Assistance Systems (ADAS) and autonomous vehicles.

For the development of this project, we collect nine datasets using GPS, camera, inertial sensors (accelerometers and gyroscopes), magnetometer, and temperature sensor. These data were produced with contextual variations including three different vehicles, driven by three different drivers, traveling through three different environments. To recognize and classify the vehicular perception patterns, we have developed several models based on Artificial Intelligence, among Classical Machine Learning and Deep Learning approaches. Below we describe the datasets produced, models developed and the results obtained, together with published scientific papers and source-codes.

Table of Contents

Vehicular Perception Research

Passive Vehicular Sensors Dataset

Data Classes

Best Models

Aplications

How To Cite

License

Vehicular Perception Research

The project is active and we are currently developing new models for new perception pattern recognition. Below are described the research progress, in chronological order of research development. At the Research Gate you can also find the published scientific papers and request a full text for free.

Research in English

Vehicular Perception Based on Inertial Sensing: a Systematic Review

In this paper, we describe the state-of-the-art vehicle perception produced through inertial sensors and Artificial Intelligence techniques. Through a literature review, we compiled the data extracted from the selected studies and described each paper in detail and chronological order of publication. Access here

Vehicular Perception Based on Inertial Sensing: A Structured Mapping of Approaches and Methods

In this paper, we present a structured literature mapping of the state-of-the-art vehicle perception produced through inertial sensors and Artificial Intelligence techniques. We describe a structured, approach, and technologies-oriented panorama of this field. Access here

[Road Surface Type Classification Based on Inertial Sensors and Machine Learning: A Comparison Between Cla...
Github-metadata-for-project-effort-estimates
kaggle.com
zip
Updated Nov 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Moreno (2025). Github-metadata-for-project-effort-estimates [Dataset]. https://www.kaggle.com/datasets/carlosxmoreno/github-metadata-for-project-effort-estimates
Explore at:
zip(6637856 bytes)Available download formats
Dataset updated
Nov 9, 2025
Authors
Carlos Moreno
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

This dataset contains metadata for over 37,000 GitHub repositories, created for research on early-stage software project effort estimation.
It includes 37 attributes describing repository size, activity, collaboration, licensing, and language usage.

The dataset has been used with unsupervised machine learning models to analyze project segmentation according to different levels of effort.
No labeled data related to project or product complexity is included.

Data Collection

The data was gathered using both the GitHub GraphQL API and the GitHub REST API.
Repositories were included if they met all of the following criteria:

They are software repositories.

They are public.

They were created in or after 2018.

Most attributes were obtained via the GraphQL API.
The contributors attribute is not accessible via GraphQL, so it was collected using the REST API for each repository and then merged with the rest of the metadata.

Some attributes are synthetic features added during data analysis, such as:

language_count

reponame

Attributes

Below is a description of the 37 attributes included in the JSON schema:

name – Repository name.

description – Repository description.

stargazerCount – Number of stars the repository has received.

forkCount – Number of forks of the repository.

createdAt – Date and time the repository was created.

updatedAt – Date and time the repository was last updated.

pushedAt – Date and time of the last push to the repository.

diskUsage – Size of the repository in kilobytes.

isArchived – Indicates if the repository is archived (boolean).

isEmpty – Indicates if the repository is empty (boolean).

isFork – Indicates if the repository is a fork of another repository (boolean).

isInOrganization – Indicates if the repository belongs to an organization (boolean).

isPrivate – Indicates if the repository is private (boolean).

isTemplate – Indicates if the repository is a template (boolean).

hasIssuesEnabled – Indicates if issues are enabled in the repository (boolean).

hasWikiEnabled – Indicates if the wiki is enabled in the repository (boolean).

hasProjectsEnabled – Indicates if projects are enabled in the repository (boolean).

hasSponsorshipsEnabled – Indicates if sponsorships are enabled in the repository (boolean).

mergeCommitAllowed – Indicates if merge commits are allowed in pull requests (boolean).

viewerCanSubscribe – Indicates if the viewer can subscribe to notifications (boolean).

contributors – Number of contributors to the repository.

owner.login – Username of the repository owner.

owner.url – URL of the repository owner's profile.

licenseInfo.name – License type applied to the repository.

primaryLanguage.name – Primary programming language used in the repository.

languages.nodes – List of programming languages used in the repository.

issues – Total number of issues in the repository.

forks – Total number of forks of the repository.

assignableUsers – Total number of users who can be assigned issues or pull requests.

deployments – Total number of deployments for the repository.

environments – Total number of deployment environments.

milestones – Total number of milestones in the repository.

releases – Total number of releases in the repository.

pullRequests – Total number of pull requests in the repository.

watchers – Total number of watchers of the repository.

reponame – Full repository name in the format owner/repository.

language_count – Number of programming languages used in the repository.

Files and structure

df_clean.json – Primary dataset containing metadata for over 37,000 public GitHub repositories used for model training and analysis.

df_new_records.json – Out-of-time validation sample with the same schema, used to evaluate model generalization performance (> 5.000 repositories)

Both datasets share identical variable definitions and schema.

Intended Use

This dataset is intended for research and experimentation in:

Software project effort estimation

Repository analytics and software engineering metrics

Unsupervised learning for project segmentation and clustering

Studies on open-source project characteristics since 2018

Limitations

Only public repositories created from 2018 onward are included.

No direct labels for effort or complexity are provided.

The dataset is a snapshot of GitHub at the time of collection; repository metadata may have changed since then.

Privacy, Ethics, and Data P...
Most Popular GitHub Projects
kaggle.com
zip
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Most Popular GitHub Projects [Dataset]. https://www.kaggle.com/thedevastator/domain-analysis-of-5000-most-popular-github-repo
Explore at:
zip(287982 bytes)Available download formats
Dataset updated
Jan 4, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Most Popular GitHub Projects

Popularity Factors and Growth Patterns

By [source]

About this dataset

This dataset contains the characteristics of 5000 of the most popular GitHub repositories, based on their total number of stars. It provides a comprehensive overview of each repository's essential features like name, language, description, URL, and growth pattern. Additionally, it offers insight into how these properties factor into the popularity and success of each repository. This can be especially helpful in understanding how certain languages or patterns are more successful in particular use cases or scenarios compared to others. By better understanding these factors and patterns developers can create projects that best suit their needs while having a higher chance at achieving success on GitHub

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a comprehensive analysis of the domains of the most popular GitHub repositories, as measured by their total number of stars. It includes many valuable pieces of information that can be used to gain insight into current trends on the platform.

In order to use this dataset to its fullest potential, it's important to understand each piece of data provided and how it can be used.

Research Ideas

Comparing the popularity of various programming languages on GitHub.

Examining the most common topics and domains represented in top repositories, to better understand how developers use GitHub for their projects.

Identifying if certain growth patterns can be associated with higher popularity levels on GitHub, as measured by stars and forks

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Domains of 5,000 GitHub Repositories - Public - Domains.csv | Column name | Description | |:-------------------|:----------------------------------------------------------------------------------------------------------------------------------| | Name | The name of the repository. (String) | | Stars | The total number of stars which serve as a metric to measure popularity. (Integer) | | Forks | The total number of forks which indicate how much collaboration there is on a project. (Integer) | | Language | The programming language used in the repository. (String) | | Description | A brief overview describing what the repository does and its features. (String) | | URL | The URL associated with that specific repository. (String) | | Domain | The domain or area within which this particular project works. For example, artificial intelligence or machine learning. (String) | | Growth Pattern | This property gives insight into whether the popularity has been increasing steadily or if it’s plateaued out etc. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
w
Relationship and Entity Extraction Evaluation Dataset (Documents)
data.wu.ac.at
data.europa.eu
json
Updated Jan 20, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Defence Science and Technology Laboratory (2018). Relationship and Entity Extraction Evaluation Dataset (Documents) [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/MWM1MmZkY2UtZWE5Ni00MDIwLThlN2MtNTkxMmNjZWM4NWU5
Explore at:
jsonAvailable download formats
Dataset updated
Jan 20, 2018
Dataset provided by
Defence Science and Technology Laboratory
Description
This document dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'documents.json' file is to be used together with the 'entities.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.
h
git-diff_to_commit_msg
huggingface.co
kaggle.com
Updated Oct 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epasinghe (2025). git-diff_to_commit_msg [Dataset]. https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg
Explore at:
Dataset updated
Oct 5, 2025
Authors
Epasinghe
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Hi, I’m Seniru Epasinghe 👋

I’m an AI undergraduate and an AI enthusiast, working on machine learning projects and open-source contributions.I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.

🌐 Connect with me There are 2 version of this dataset:

git-diff_to_commit_msg - 1.5K rows huggingface link kaggle link

git-diff_to_commit_msg_large - 1.75M rows huggingface link kaggle link… See the full description on the dataset page: https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg.
Chicago Crime with Climate Data, 2021
kaggle.com
zip
Updated Dec 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Rozenberg (2021). Chicago Crime with Climate Data, 2021 [Dataset]. https://www.kaggle.com/datasets/markrozenberg/chicago-crime-with-climate-data-2021
Explore at:
zip(5305421 bytes)Available download formats
Dataset updated
Dec 24, 2021
Authors
Mark Rozenberg
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Chicago
Description
In this project I used machine learning and deep learning multiclass classification algorithms to predict types of crime commited in the city of Chicago in 2021. Moreover, I added weather data as features to the models with hope that the last will enrich the models and improve predictions.

project page on GitHub:

https://github.com/Mark-Rozenberg/Crime-And-Climate
I
TextTransfer: Datasets for Impact Detection
databank.illinois.edu
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Becker; Kanyao Han; Antonina Werthmann; Rezvaneh Rezapour; Haejin Lee; Jana Diesner; Andreas Witt (2024). TextTransfer: Datasets for Impact Detection [Dataset]. http://doi.org/10.13012/B2IDB-9934303_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9934303_V1
Dataset updated
Mar 21, 2024
Authors
Maria Becker; Kanyao Han; Antonina Werthmann; Rezvaneh Rezapour; Haejin Lee; Jana Diesner; Andreas Witt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
German Federal Ministry of Education and Research
Description
Impact assessment is an evolving area of research that aims at measuring and predicting the potential effects of projects or programs. Measuring the impact of scientific research is a vibrant subdomain, closely intertwined with impact assessment. A recurring obstacle pertains to the absence of an efficient framework which can facilitate the analysis of lengthy reports and text labeling. To address this issue, we propose a framework for automatically assessing the impact of scientific research projects by identifying pertinent sections in project reports that indicate the potential impacts. We leverage a mixed-method approach, combining manual annotations with supervised machine learning, to extract these passages from project reports. This is a repository to save datasets and codes related to this project. Please read and cite the following paper if you would like to use the data: Becker M., Han K., Werthmann A., Rezapour R., Lee H., Diesner J., and Witt A. (2024). Detecting Impact Relevant Sections in Scientific Research. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING). This folder contains the following files: evaluation_20220927.ods: Annotated German passages (Artificial Intelligence, Linguistics, and Music) - training data annotated_data.big_set.corrected.txt: Annotated German passages (Mobility) - training data incl_translation_all.csv: Annotated English passages (Artificial Intelligence, Linguistics, and Music) - training data incl_translation_mobility.csv: Annotated German passages (Mobility) - training data ttparagraph_addmob.txt: German corpus (unannotated passages) model_result_extraction.csv: Extracted impact-relevant passages from the German corpus based on the model we trained rf_model.joblib: The random forest model we trained to extract impact-relevant passages Data processing codes can be found at: https://github.com/khan1792/texttransfer

Package	#Graphs	#Nodes per Graph	#Edges per Graph	Split Type	Task Type	Metric
`ogb>=1.2.0`	452,741	125.2	124.2	Project	Sub-token prediction	F1 score

Facebook

Twitter

Click to copy link

Link copied

Cite

Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.21967265.v1

Dataset updated

May 30, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE

Clear search

Close search

Google apps

Main menu

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Open Machine Learning Projects

DESCRIPTION

SUMMARY

Presentation

Columns

Data from: A Dataset for GitHub Repository Deduplication

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

Data from: A Benchmark Suite for Systematically Evaluating Reasoning...

Materials and their Mechanical Properties

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

General

Summary

Abstract

Dataset documentation:

Files

How to work with the image files

How to open zarr files

How to view zarr image files

Metrics

Baseline

License

Citation

Acknowledgments

Changelog

Contributing

Synthea synthetic patient data for lung cancer risk prediction machine...

OGBG-Code (Processed for PyG)

OGBN-Code

Usage in Python

Description

Summary

License: MIT License

Open Graph Benchmark

References

Disclaimer

Data from: A large-scale comparative analysis of Coding Standard conformance...

A Representative User-centric GitHub Developers Dataset for Malicious...

PVS - Passive Vehicular Sensors Datasets

Intelligent Vehicle Perception Based on Inertial Sensing and Artificial Intelligence

Table of Contents

Vehicular Perception Research

Research in English

Vehicular Perception Based on Inertial Sensing: a Systematic Review

Vehicular Perception Based on Inertial Sensing: A Structured Mapping of Approaches and Methods

[Road Surface Type Classification Based on Inertial Sensors and Machine Learning: A Comparison Between Cla...

Github-metadata-for-project-effort-estimates

Context

Data Collection

Attributes

Files and structure

Intended Use

Limitations

Privacy, Ethics, and Data P...

Most Popular GitHub Projects

Most Popular GitHub Projects

Popularity Factors and Growth Patterns

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

UCI and OpenML Data Sets for Ordinal Quantification

Relationship and Entity Extraction Evaluation Dataset (Documents)

git-diff_to_commit_msg

Chicago Crime with Climate Data, 2021

In this project I used machine learning and deep learning multiclass classification algorithms to predict types of crime commited in the city of Chicago in 2021. Moreover, I added weather data as features to the models with hope that the last will enrich the models and improve predictions.

project page on GitHub:

TextTransfer: Datasets for Impact Detection