100+ datasets found

Data from: A Dataset of Bot and Human Activities in GitHub
zenodo.org
json, txt
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470
Explore at:
json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8219470
Dataset updated
Jan 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Dataset of Bot and Human Activities in GitHub

This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

Files description

The following files are provided as part of the archive:

bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;

human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);

JsonSchema.json - A JSON schema that validates the above datasets;

bots.txt - A TEXT file containing login names of all the 350 bots

Example

Below is an example of a Closing pull request activity:

{ "date": "2022-11-25T18:49:09+00:00", "activity": "Closing pull request", "contributor": "typescript-bot", "repository": "DefinitelyTyped/DefinitelyTyped", "comment": { "length": 249, "GH_node": "IC_kwDOAFz6BM5PJG7l" }, "pull_request": { "id": 62328, "title": "[qunit] Add `test.each()`", "created_at": "2022-09-19T17:34:28+00:00", "status": "closed", "closed_at": "2022-11-25T18:49:08+00:00", "merged": false, "GH_node": "PR_kwDOAFz6BM4_N5ib" }, "conversation": { "comments": 19 }, "payload": { "pr_commits": 1, "pr_changed_files": 5 } }

List of activity types

In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

List of fields

Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

Properties

date

Date on which the activity is performed

Type: string

e.g., "2022-11-25T09:55:19+00:00"

String format must be a "date-time"

activity

The activity performed by the contributor

Type: string

e.g., "Commenting pull request"

contributor

The login name of the contributor who performed this activity

Type: string

e.g., "analysis-bot", "anonymised" in the case of a human contributor

repository

The repository in which the activity is performed

Type: string

e.g., "apache/spark", "anonymised" in the case of a human contributor

issue

Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue

Type: object

Properties

id

Issue number

Type: integer

e.g., 35471

title

Issue title

Type: string

e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor

created_at

The date on which this issue is created

Type: string

e.g., "2022-11-10T13:07:23+00:00"

String format must be a "date-time"

status

Current state of the issue

Type: string

"open" or "closed"

closed_at

The date on which this issue is closed. "null" will be provided if the issue is open

Types: string, null

e.g., "2022-11-25T10:42:39+00:00"

String format must be a "date-time"

resolved

The issue is resolved or not_planned/still open

Type: boolean

true or false

GH_node

The GitHub node of this issue

Type: string

e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor

pull_request

Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code

Type: object

Properties

id

Pull request number

Type: integer

e.g., 35471

title

Pull request title

Type: string

e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor

created_at

The date on which this pull request is created

Type: string

e.g., "2022-11-10T13:07:23+00:00"

String format must be a "date-time"

status

Current state of the pull request

Type: string

"open" or "closed"

closed_at

The date on which this pull request is closed. "null" will be provided if the pull request is open

Types: string, null

e.g., "2022-11-25T10:42:39+00:00"

String format must be a "date-time"

merged

The PR is merged or rejected/still open

Type: boolean

true or false

GH_node

The GitHub node of this pull request

Type: string

e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor

review

Pull request review information - provided for Reviewing code

Type: object

Properties

status

Status of the review

Type: string

"changes_requested" or "approved" or "dismissed"

GH_node

The GitHub node of this review

Type: string

e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor

conversation

Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request

Type: object

Properties

comments

Number of comments present in the corresponding issue or pull request

Type: integer

e.g.,
d
Placenames API GitHub code repository
data.gov.au
html
Updated Oct 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Commonwealth of Australia (Geoscience Australia) (2020). Placenames API GitHub code repository [Dataset]. https://data.gov.au/dataset/ds-ga-251fc23d-c59c-4a1a-be04-2c25e4e7d45b
Explore at:
htmlAvailable download formats
Dataset updated
Oct 1, 2020
Dataset provided by
Commonwealth of Australia (Geoscience Australia)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The source code for the National Composite Gazetteer Linked data API Code located at: https://github.com/GeoscienceAustralia/placenames-dataset for live API instance at: http://linked.data.gov.au/dat…Show full descriptionThe source code for the National Composite Gazetteer Linked data API Code located at: https://github.com/GeoscienceAustralia/placenames-dataset for live API instance at: http://linked.data.gov.au/dataset/placenames
f
A Representative User-centric GitHub Developers Dataset for Malicious...
figshare.com
png
Updated Dec 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yushan Liu (2022). A Representative User-centric GitHub Developers Dataset for Malicious Account Detection [Dataset]. http://doi.org/10.6084/m9.figshare.21789566.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21789566.v1
Dataset updated
Dec 29, 2022
Dataset provided by
figshare
Authors
Yushan Liu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.

We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.

Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.
PIPr: A Dataset of Public Infrastructure as Code Programs
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10173400
Dataset updated
Nov 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
metadata.zip: The dataset metadata and analysis results as CSV files.
scripts-and-logs.zip: Scripts and logs of the dataset creation.
LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
README.md: This document.
redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
Metadata
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
ID (integer): GitHub repository ID
url (string): GitHub repository URL
downloaded (boolean): Whether cloning the repository succeeded
name (string): Repository name
description (string): Repository description
licenses (string, list of strings): Repository licenses
redistributable (boolean): Whether the repository's licenses permit redistribution
created (string, date & time): Time of the repository's creation
updated (string, date & time): Time of the last update to the repository
pushed (string, date & time): Time of the last push to the repository
fork (boolean): Whether the repository is a fork
forks (integer): Number of forks
archive (boolean): Whether the repository is archived
programs (string, list of strings): Project file path of each IaC program in the repository
programs.csv:
ID (string): Project file path of the IaC program
repository (integer): GitHub repository ID of the repository containing the IaC program
directory (string): Path of the directory containing the IaC program's project file
solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
name (string): IaC program name
description (string): IaC program description
runtime (string): Runtime string of the IaC program
testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
tests (string, list of strings): File paths of IaC program's tests
testing-files.csv:
file (string): Testing file path
language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
program (string): Project file path of the testing file's IaC program
Dataset Creation
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.
Searching Repositories
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token.
Name of the CSV output file.
Filename to search for.
File extensions to search for, separated by commas.
Min file size for the search (for all files: 0).
Max file size for the search or * for unlimited (for all files: *).
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
Limitations
The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included.
Only the repositories' default branches are considered.
Only files smaller than 384 KB are searchable.
Only repositories with fewer than 500,000 files are considered.
Only repositories that have had activity or have been returned in search results in the last year are considered.
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
Downloading Repositories
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas.
Output directory to download the repositories to.
Name of the CSV output file.
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
h
code-parrot-github-code
huggingface.co
Updated Mar 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macrocosmos (2022). code-parrot-github-code [Dataset]. https://huggingface.co/datasets/macrocosm-os/code-parrot-github-code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2022
Dataset authored and provided by
Macrocosmos
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
GitHub Code Dataset

Dataset Description

The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

How to use it

The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the following… See the full description on the dataset page: https://huggingface.co/datasets/macrocosm-os/code-parrot-github-code.
Data.gov CKAN API
catalog.data.gov
datasets.ai
+3more
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data.gov (2020). Data.gov CKAN API [Dataset]. https://catalog.data.gov/dataset/data-gov-ckan-api
Explore at:
Dataset updated
Nov 10, 2020
Dataset provided by
Data.govhttps://data.gov/
Description
The data.gov catalog is powered by CKAN, a powerful open source data platform that includes a robust API. Please be aware that data.gov and the data.gov CKAN API only contain metadata about datasets. This metadata includes URLs and descriptions of datasets, but it does not include the actual data within each dataset.
GitHub data privacy commits from JSS 2025
zenodo.org
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Kapitsaki; Georgia Kapitsaki; Maria Papoutsoglou; Maria Papoutsoglou (2025). GitHub data privacy commits from JSS 2025 [Dataset]. http://doi.org/10.5281/zenodo.15532947
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15532947
Dataset updated
May 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georgia Kapitsaki; Georgia Kapitsaki; Maria Papoutsoglou; Maria Papoutsoglou
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset on commits (and repositories) on GitHub making reference to data privacy legislation (covering laws: GDPR, CCPA, CPRA, UK DPA).

The dataset contains:
+ all_commits_info_merged-v2-SHA.csv : commits information as collected from various GitHub REST API calls (all data merged together).
+ repos_info_merged_USED-v2_with_loc.csv: repository information with some calculated data.
+ top-70-repos-commits-for-manual-check_commits-2coders.xlsx: results of the manual coding of the commits of the 70 most popular repositories in dataset.
+ user-rights-ω3.csv: different terms for user rights teriminology in legislation.
+ github_commits_analysis_replication.r: main analysis pipeline covering all RQs in the R programming language.

In order to perform also the initial data collection, the GitHub REST API can be used, collecting data using time intervals, for instance:
https://api.github.com/search/commits?q=%22GDPR%22+committer-date:2018-05-25..2018-05-30&sort=committer-date&order=asc&per_page=100&page=1

This dataset accompanies the following publication, so please cite it accordingly:

Georgia M. Kapitsaki, Maria Papoutsoglou, Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era, accepted for publication at Elsevier Journal of Systems & Software, 2025.
GitHub Social Network
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gitanjali Wadhwa
Description
Description

An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

Properties

Directed: No.

Node features: Yes.

Edge features: No.

Node labels: Yes. Binary-labeled.

Temporal: No.

Nodes: 37,700

Edges: 289,003

Density: 0.001

Transitvity: 0.013

Possible Tasks

Binary node classification

Link prediction

Community detection

Network visualisation
f
COVID-19 Twitter Dataset
figshare.com
borealisdata.ca
zip
Updated Oct 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Media Lab (2021). COVID-19 Twitter Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.16713448.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16713448.v1
Dataset updated
Oct 2, 2021
Dataset provided by
figshare
Authors
Social Media Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The current dataset contains Tweet IDs for tweets mentioning "COVID" (e.g., COVID-19, COVID19) and shared between March and July of 2020.Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms.NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs
h
github-issues-simple
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikerachel521, github-issues-simple [Dataset]. https://huggingface.co/datasets/rachel521/github-issues-simple
Explore at:
Authors
Mikerachel521
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Dataset Name

This dataset is sourced from the issues within the official datasets repository on GitHub. We collected the corresponding issues and their comments via the GitHub API, performed some light data cleaning, and finally, randomly selected 1,000 samples to create this dataset.

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More… See the full description on the dataset page: https://huggingface.co/datasets/rachel521/github-issues-simple.
Data from: OpenAIRE Graph Beginner's Kit Dataset
zenodo.org
pub.uni-bielefeld.de
tar
Updated Aug 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou (2023). OpenAIRE Graph Beginner's Kit Dataset [Dataset]. http://doi.org/10.5281/zenodo.8223812
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8223812
Dataset updated
Aug 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The OpenAIRE Graph is an Open Access dataset containing metadata about research products (literature, datasets, software, etc.) linked to other entities of the research ecosystem like organisations, project grants, and data sources.

The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.

The OpenAIRE Beginner’s Kit aims to address this issue. It consists of two components:

A subset of the OpenAIRE Graph composed of the research products published between 2022-12-28 and 2023-07-31, all the entities connected to them and the respective relationships. The subset is composed of the following parts:

publication.tar: metadata records about research literature (includes types of publications listed here)

dataset.tar: metadata records about research data (includes the subtypes listed here)

software.tar: metadata records about research software (includes the subtypes listed here)

otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)

organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.

datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.

project.tar: metadata records about project grants.

relation.tar: metadata records about relations between entities in the graph.

communities_infrastructures.tar: metadata records about research communities and research infrastructures

Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.8238874.

The code to analyse the data. It is available on GitHub. Just download the archive, unzip/untar it and follow the instruction on the README file (no need to clone the GitHub repository)
d
Dataset metadata of known Dataverse installations
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DCDKZQ
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Gautier, Julian
Description
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Data from: GHTraffic: A Dataset for Reproducible Research in...
zenodo.org
zip
Updated Aug 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen (2020). GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing [Dataset]. http://doi.org/10.5281/zenodo.3748921
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3748921
Dataset updated
Aug 29, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the latest version of the GHTraffic project. The main aim is to model a variety of transaction sequences to reflect more complex service behaviour.

This version consists of a single edition collected from the google/guava repository.

The entire data generation process is quite similar to the original GHTraffic design. But it incorporates minor changes to the process of synthetic data generation where it uses a random date after successfully posting a resource to make up the request and response for all of the HTTP methods. It also adds yet another subset of unsuccessful transactions by stipulating requests before resource creation is successful.

This results in a far more dynamic series of transactions to named resources.

Scripts used for datasets construction are accessible from the repository.
MH-1M Dataset
figshare.com
zip
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hendrio Bragança; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa (2025). MH-1M Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28355897.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28355897.v2
Dataset updated
Feb 21, 2025
Dataset provided by
figshare
Authors
Hendrio Bragança; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The rapid and widespread increase of Android malware presents substantial obstacles to cybersecurity research. In order to revolutionize the field of malware research, we present the MH-1M dataset, which is a thorough compilation of 1,340,515 APK samples. This dataset encompasses a wide range of diverse attributes and metadata, offering a comprehensive perspective. The utilization of the VirusTotal API guarantees precise assessment of threats by amalgamating various detection techniques. Our research indicates that MH-1M is a highly current dataset that provides valuable insights into the changing nature of malware.MH-1M consists of 23,247 features that cover a wide range of application behavior, from intents::accept to apicalls::landroid/window/splashscreenview.remove. The features are categorized into four primary classifications:Feature TypesValuesAPICalls22,394Intents407OPCodes232Permissions214The dataset is stored efficiently, utilizing a memory capacity of 29.0 GB, which showcases its substantial yet controllable magnitude. The dataset consists of 1,221,421 benign applications and 119,094 malware applications, ensuring a balanced representation for accurate malware detection and analysis.The MH-1M repository also offers a wide variety of metadata from APKs, providing useful data into the development of malicious software over a period of more than ten years. The Android features include a wide variety of metadata, which includes SHA256 hashes, file names, package names, compilation APIs, and various other details. This GitHub repository contains over 400GB of valuable data, making it the largest and most comprehensive dataset available for advancing research and development in Android malware detection.
o
Social Media Profile Links by Name
openwebninja.com
json
Updated Feb 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja (2025). Social Media Profile Links by Name [Dataset]. https://www.openwebninja.com/api/social-links-search
Explore at:
jsonAvailable download formats
Dataset updated
Feb 2, 2025
Dataset authored and provided by
OpenWeb Ninja
Area covered
Worldwide
Description
This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.
Z
88.6 Million Developer Comments from GitHub
data.niaid.nih.gov
Updated Jan 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin S. Meyers; Andrew Meneely (2024). 88.6 Million Developer Comments from GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5596536
Explore at:
Dataset updated
Jan 4, 2024
Dataset provided by
Rochester Institute of Technology
Authors
Benjamin S. Meyers; Andrew Meneely
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This is a collection of developer comments from GitHub issues, commits, and pull requests. We collected 88,640,237 developer comments from 17,378 repositories. In total, this dataset includes:

54,252,380 issue comments (from 13,458,208 issues)

979,642 commit comments (from 49,710,108 commits)

33,408,215 pull request comments (from 12,680,373 pull requests)

Warning: The uploaded dataset is compressed from 185GB down to 25.1GB.

Purpose

The purpose of this dataset (corpus) is to provide a large dataset of software developer comments (natural language) for research. We intend to use this data in our own research, but we hope it will be helpful for other researchers.

Collection Process

Full implementation details can be found in the following publication:

Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.

Data was downloaded using GitHub's GraphQL API via requests made with Python's requests library. We targeted 17,491 repositories with the following criteria:

At least 850 stars.

Primary language in the Top 50 from the TIOBE Index and/or listed as "popular" in GitHub's advanced search. Note that we collected the list of languages on August 31, 2021.

Due to design decisions made by GitHub, we could only get a list of at most 1,000 repositories for each target language. Comments from 113 repositories could not be downloaded for various reasons (failing API queries, JSONDecoderErrors, etc.). Eight target languages had no repositories matching the above criteria.

After collection using the GraphQL API, data was written to CSV using Python's csv.writer class. We highly recommend using Python's csv.reader to parse these CSV files as no newlines have been removed from developer comments.

88_million_developer_comments.zip

This zip file contains 135 CSV files; 3 per language. CSV names are formatted _.csv, with being the name of the primary language and being one of co (commits), is (issues), or pr (pull requests).

Languages included are: ABAP, Assembly, C, C# (C-Sharp), C++ (C-PlusPlus), Clojure, COBOL, CoffeeScript, CSS, Dart, D, DM, Elixir, Fortran, F# (F-Sharp), Go, Groovy, HTML, Java, JavaScript, Julia, Kotlin, Lisp, Lua, MATLAB, Nim, Objective-C, Pascal, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Rust, Scala, Scheme, Scratch, Shell, Swift, TSQL, TypeScript, VBScript, and VHDL.

Details on the columns in each CSV file are described in the provided README.md.

Detailed_Breakdown.ods

This spreadsheet contains specific details on how many repositories, commits, issues, pull requests, and comments are included in 88_million_developer_comments.zip.

Note On Completeness

We make no guarantee that every commit, issue, and/or pull request for each repository is included in this dataset. Due to the nature of the GraphQL API and data decoding difficulties, sometimes a query failed and that data is not included here.

Versioning

v1.1: The original corpus had duplicate header rows in the CSV files. This has been fixed.

v1.0: Original corpus.

Contact

Please contact Benjamin S. Meyers (email) with questions about this data and its collection.

Acknowledgments

Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).

This data was collected using the compute resources from the Research Computing department at the Rochester Institute of Technology. doi:10.34788/0S3G-QD15
Z
Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven...
data.niaid.nih.gov
zenodo.org
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lina Ochoa; Thomas Degueule; Jean-Rémy Falleri; Jurgen Vinju (2021). Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central (Dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5221839
Explore at:
Dataset updated
Aug 20, 2021
Dataset provided by
Univ. Bordeaux, Bordeaux INP, CNRS, LaBRI
Eindhoven University of Technology
Centrum Wiskunde & Informatica
Univ. Bordeaux, Bordeaux INP, CNRS, LaBRI, Institut Universitaire de France
Authors
Lina Ochoa; Thomas Degueule; Jean-Rémy Falleri; Jurgen Vinju
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The content presented in this repository accompanies the paper "Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central" authored by Lina Ochoa, Thomas Degueule, Jean-Rémy Falleri, and Jurgen Vinju. The paper was submitted and accepted in the Journal of Empirical Software Engineering (EMSE'21). This study is an external and differentiated replication study of the paper "Semantic Versioning and Impact of Breaking Changes in the Maven Repository" presented by Steven Raemaekers, Arie van Deursen, and Joost Visser.

Content

README.md: document with the main description to start exploring the bundle.

data.zip: contains the datasets used within the study. These datasets must be used to get the same results like the ones presented in the article.

maven-api-dataset.zip: contains the code used to generate the datasets and to analyse the obtained results. Check the README.md file within this bundle for more information.

Relevant Links

maven-api-dataset repository: https://github.com/tdegueul/maven-api-dataset

maracas repository: https://github.com/crossminer/maracas

Companion webpage: https://crossminer.github.io/maracas/2021/08/16/emse21/
Z
Data from: AOL Dataset for Browsing History and Topics of Interest
data.niaid.nih.gov
Updated Jun 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nunes, Gabriel Henrique (2024). AOL Dataset for Browsing History and Topics of Interest [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11029571
Explore at:
Dataset updated
Jun 24, 2024
Dataset authored and provided by
Nunes, Gabriel Henrique
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
AOL Dataset for Browsing History and Topics of Interest

This record provides the datasets of the paper The Privacy-Utility Trade-off in the Topics API.

The datasets generating code and the experimental results can be found in 10.5281/zenodo.11032231 (github.com/nunesgh/topics-api-analysis).

License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
League of Legends Master+ Players
kaggle.com
zip
Updated Sep 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ignacio Guillermo Martinez (2021). League of Legends Master+ Players [Dataset]. https://www.kaggle.com/jasperan/league-of-legends-master-players
Explore at:
zip(11694163 bytes)Available download formats
Dataset updated
Sep 22, 2021
Authors
Ignacio Guillermo Martinez
Description
GitHub repository

Click Here

Why?

I am writing articles on League of Legends and Machine Learning. You can find the full repository where this information is stored here.
FiveThirtyEight Bob Ross Dataset
kaggle.com
Updated Apr 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FiveThirtyEight (2019). FiveThirtyEight Bob Ross Dataset [Dataset]. https://www.kaggle.com/datasets/fivethirtyeight/fivethirtyeight-bob-ross-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
FiveThirtyEight
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Content

Bob Ross

This folder contains data behind the story A Statistical Analysis of the Work of Bob Ross.

Context

This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!

Update Frequency: This dataset is updated daily.

Acknowledgements

This dataset is maintained using GitHub's API and Kaggle's API.

This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.

Cover photo by Alex Kotomanov on Unsplash
Unsplash Images are distributed under a unique Unsplash License.

Facebook

Twitter

Click to copy link

Link copied

Cite

Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470

Data from: A Dataset of Bot and Human Activities in GitHub

Explore at:

json, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8219470

Dataset updated

Jan 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A Dataset of Bot and Human Activities in GitHub

This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

Files description

The following files are provided as part of the archive:

bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;
human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);
JsonSchema.json - A JSON schema that validates the above datasets;
bots.txt - A TEXT file containing login names of all the 350 bots

Example

Below is an example of a Closing pull request activity:

{
 "date": "2022-11-25T18:49:09+00:00",
 "activity": "Closing pull request",
 "contributor": "typescript-bot",
 "repository": "DefinitelyTyped/DefinitelyTyped",
 "comment": {
   "length": 249,
   "GH_node": "IC_kwDOAFz6BM5PJG7l"
 },
 "pull_request": {
   "id": 62328,
   "title": "[qunit] Add `test.each()`",
   "created_at": "2022-09-19T17:34:28+00:00",
   "status": "closed",
   "closed_at": "2022-11-25T18:49:08+00:00",
   "merged": false,
   "GH_node": "PR_kwDOAFz6BM4_N5ib"
 },
 "conversation": {
   "comments": 19
 },
 "payload": {
   "pr_commits": 1,
   "pr_changed_files": 5
 }
}

List of activity types

In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

List of fields

Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

Properties

date
- Date on which the activity is performed
- Type: string
- e.g., "2022-11-25T09:55:19+00:00"
- String format must be a "date-time"
activity
- The activity performed by the contributor
- Type: string
- e.g., "Commenting pull request"
contributor
- The login name of the contributor who performed this activity
- Type: string
- e.g., "analysis-bot", "anonymised" in the case of a human contributor
repository
- The repository in which the activity is performed
- Type: string
- e.g., "apache/spark", "anonymised" in the case of a human contributor
issue
- Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue
- Type: object
- Properties
  - id
    - Issue number
    - Type: integer
    - e.g., 35471
  - title
    - Issue title
    - Type: string
    - e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
  - created_at
    - The date on which this issue is created
    - Type: string
    - e.g., "2022-11-10T13:07:23+00:00"
    - String format must be a "date-time"
  - status
    - Current state of the issue
    - Type: string
    - "open" or "closed"
  - closed_at
    - The date on which this issue is closed. "null" will be provided if the issue is open
    - Types: string, null
    - e.g., "2022-11-25T10:42:39+00:00"
    - String format must be a "date-time"
  - resolved
    - The issue is resolved or not_planned/still open
    - Type: boolean
    - true or false
  - GH_node
    - The GitHub node of this issue
    - Type: string
    - e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor
pull_request
- Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code
- Type: object
- Properties
  - id
    - Pull request number
    - Type: integer
    - e.g., 35471
  - title
    - Pull request title
    - Type: string
    - e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
  - created_at
    - The date on which this pull request is created
    - Type: string
    - e.g., "2022-11-10T13:07:23+00:00"
    - String format must be a "date-time"
  - status
    - Current state of the pull request
    - Type: string
    - "open" or "closed"
  - closed_at
    - The date on which this pull request is closed. "null" will be provided if the pull request is open
    - Types: string, null
    - e.g., "2022-11-25T10:42:39+00:00"
    - String format must be a "date-time"
  - merged
    - The PR is merged or rejected/still open
    - Type: boolean
    - true or false
  - GH_node
    - The GitHub node of this pull request
    - Type: string
    - e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor
review
- Pull request review information - provided for Reviewing code
- Type: object
- Properties
  - status
    - Status of the review
    - Type: string
    - "changes_requested" or "approved" or "dismissed"
  - GH_node
    - The GitHub node of this review
    - Type: string
    - e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor
conversation
- Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request
- Type: object
- Properties
  - comments
    - Number of comments present in the corresponding issue or pull request
    - Type: integer
    - e.g.,

Clear search

Close search

Google apps

Main menu

Data from: A Dataset of Bot and Human Activities in GitHub

Placenames API GitHub code repository

A Representative User-centric GitHub Developers Dataset for Malicious...

PIPr: A Dataset of Public Infrastructure as Code Programs

Metadata

Dataset Creation

Searching Repositories

Limitations

Downloading Repositories

code-parrot-github-code

Data.gov CKAN API

GitHub data privacy commits from JSS 2025

GitHub Social Network

COVID-19 Twitter Dataset

github-issues-simple

Data from: OpenAIRE Graph Beginner's Kit Dataset

Dataset metadata of known Dataverse installations

Data from: GHTraffic: A Dataset for Reproducible Research in...

MH-1M Dataset

Social Media Profile Links by Name

88.6 Million Developer Comments from GitHub

Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven...

Data from: AOL Dataset for Browsing History and Topics of Interest

League of Legends Master+ Players

GitHub repository

Why?

FiveThirtyEight Bob Ross Dataset

Content

Bob Ross

Context

Acknowledgements

Data from: A Dataset of Bot and Human Activities in GitHub