Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Dataset of Bot and Human Activities in GitHub
This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.
The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.
Files description
The following files are provided as part of the archive:
Example
Below is an example of a Closing pull request activity:
{
"date": "2022-11-25T18:49:09+00:00",
"activity": "Closing pull request",
"contributor": "typescript-bot",
"repository": "DefinitelyTyped/DefinitelyTyped",
"comment": {
"length": 249,
"GH_node": "IC_kwDOAFz6BM5PJG7l"
},
"pull_request": {
"id": 62328,
"title": "[qunit] Add `test.each()`",
"created_at": "2022-09-19T17:34:28+00:00",
"status": "closed",
"closed_at": "2022-11-25T18:49:08+00:00",
"merged": false,
"GH_node": "PR_kwDOAFz6BM4_N5ib"
},
"conversation": {
"comments": 19
},
"payload": {
"pr_commits": 1,
"pr_changed_files": 5
}
}
List of activity types
In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.
List of fields
Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.
For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.
Properties
string
string
string
string
object
integer
string
string
string
string
, null
boolean
string
object
integer
string
string
string
string
, null
boolean
string
object
string
string
object
integer
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The source code for the National Composite Gazetteer Linked data API Code located at: https://github.com/GeoscienceAustralia/placenames-dataset for live API instance at: http://linked.data.gov.au/datβ¦Show full descriptionThe source code for the National Composite Gazetteer Linked data API Code located at: https://github.com/GeoscienceAustralia/placenames-dataset for live API instance at: http://linked.data.gov.au/dataset/placenames
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the userβs profile page, the information of her commit activities and created/forked public repositories.
We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.
Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
programs.csv:
testing-files.csv:
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
The script uses the GitHub code search API and inherits its limitations:
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
GitHub Code Dataset
Dataset Description
The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.
How to use it
The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the following⦠See the full description on the dataset page: https://huggingface.co/datasets/macrocosm-os/code-parrot-github-code.
The data.gov catalog is powered by CKAN, a powerful open source data platform that includes a robust API. Please be aware that data.gov and the data.gov CKAN API only contain metadata about datasets. This metadata includes URLs and descriptions of datasets, but it does not include the actual data within each dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset on commits (and repositories) on GitHub making reference to data privacy legislation (covering laws: GDPR, CCPA, CPRA, UK DPA).
The dataset contains:
+ all_commits_info_merged-v2-SHA.csv : commits information as collected from various GitHub REST API calls (all data merged together).
+ repos_info_merged_USED-v2_with_loc.csv: repository information with some calculated data.
+ top-70-repos-commits-for-manual-check_commits-2coders.xlsx: results of the manual coding of the commits of the 70 most popular repositories in dataset.
+ user-rights-Ο3.csv: different terms for user rights teriminology in legislation.
+ github_commits_analysis_replication.r: main analysis pipeline covering all RQs in the R programming language.
In order to perform also the initial data collection, the GitHub REST API can be used, collecting data using time intervals, for instance:
https://api.github.com/search/commits?q=%22GDPR%22+committer-date:2018-05-25..2018-05-30&sort=committer-date&order=asc&per_page=100&page=1
This dataset accompanies the following publication, so please cite it accordingly:
Georgia M. Kapitsaki, Maria Papoutsoglou, Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era, accepted for publication at Elsevier Journal of Systems & Software, 2025.
Description
An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.
Properties
Possible Tasks
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current dataset contains Tweet IDs for tweets mentioning "COVID" (e.g., COVID-19, COVID19) and shared between March and July of 2020.Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms.NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset is sourced from the issues within the official datasets repository on GitHub. We collected the corresponding issues and their comments via the GitHub API, performed some light data cleaning, and finally, randomly selected 1,000 samples to create this dataset.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More⦠See the full description on the dataset page: https://huggingface.co/datasets/rachel521/github-issues-simple.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The OpenAIRE Graph is an Open Access dataset containing metadata about research products (literature, datasets, software, etc.) linked to other entities of the research ecosystem like organisations, project grants, and data sources.
The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.
The OpenAIRE Beginnerβs Kit aims to address this issue. It consists of two components:
A subset of the OpenAIRE Graph composed of the research products published between 2022-12-28 and 2023-07-31, all the entities connected to them and the respective relationships. The subset is composed of the following parts:
publication.tar: metadata records about research literature (includes types of publications listed here)
dataset.tar: metadata records about research data (includes the subtypes listed here)
software.tar: metadata records about research software (includes the subtypes listed here)
otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.
project.tar: metadata records about project grants.
relation.tar: metadata records about relations between entities in the graph.
communities_infrastructures.tar: metadata records about research communities and research infrastructures
Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.8238874.
The code to analyse the data. It is available on GitHub. Just download the archive, unzip/untar it and follow the instruction on the README file (no need to clone the GitHub repository)
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized βββ csv_files_with_metadata_from_most_known_dataverse_installations β βββ author(citation).csv β βββ basic.csv β βββ contributor(citation).csv β βββ ... β βββ topic_classification(citation).csv βββ dataverse_json_metadata_from_each_known_dataverse_installation β βββ Abacus_2022.10.02_17.11.19.zip β βββ dataset_pids_Abacus_2022.10.02_17.11.19.csv β βββ Dataverse_JSON_metadata_2022.10.02_17.11.19 β βββ hdl_11272.1_AB2_0AQZNT_v1.0.json β βββ ... β βββ metadatablocks_v5.6 β βββ astrophysics_v5.6.json β βββ biomedical_v5.6.json β βββ citation_v5.6.json β βββ ... β βββ socialscience_v5.6.json β βββ ACSS_Dataverse_2022.10.02_17.26.19.zip β βββ ADA_Dataverse_2022.10.02_17.26.57.zip β βββ Arca_Dados_2022.10.02_17.44.35.zip β βββ ... β βββ World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip βββ dataset_pids_from_most_known_dataverse_installations.csv βββ licenses_used_by_dataverse_installations.csv βββ metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the latest version of the GHTraffic project. The main aim is to model a variety of transaction sequences to reflect more complex service behaviour.
This version consists of a single edition collected from the google/guava repository.
The entire data generation process is quite similar to the original GHTraffic design. But it incorporates minor changes to the process of synthetic data generation where it uses a random date after successfully posting a resource to make up the request and response for all of the HTTP methods. It also adds yet another subset of unsuccessful transactions by stipulating requests before resource creation is successful.
This results in a far more dynamic series of transactions to named resources.
Scripts used for datasets construction are accessible from the repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The rapid and widespread increase of Android malware presents substantial obstacles to cybersecurity research. In order to revolutionize the field of malware research, we present the MH-1M dataset, which is a thorough compilation of 1,340,515 APK samples. This dataset encompasses a wide range of diverse attributes and metadata, offering a comprehensive perspective. The utilization of the VirusTotal API guarantees precise assessment of threats by amalgamating various detection techniques. Our research indicates that MH-1M is a highly current dataset that provides valuable insights into the changing nature of malware.MH-1M consists of 23,247 features that cover a wide range of application behavior, from intents::accept to apicalls::landroid/window/splashscreenview.remove. The features are categorized into four primary classifications:Feature TypesValuesAPICalls22,394Intents407OPCodes232Permissions214The dataset is stored efficiently, utilizing a memory capacity of 29.0 GB, which showcases its substantial yet controllable magnitude. The dataset consists of 1,221,421 benign applications and 119,094 malware applications, ensuring a balanced representation for accurate malware detection and analysis.The MH-1M repository also offers a wide variety of metadata from APKs, providing useful data into the development of malicious software over a period of more than ten years. The Android features include a wide variety of metadata, which includes SHA256 hashes, file names, package names, compilation APIs, and various other details. This GitHub repository contains over 400GB of valuable data, making it the largest and most comprehensive dataset available for advancing research and development in Android malware detection.
This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a collection of developer comments from GitHub issues, commits, and pull requests. We collected 88,640,237 developer comments from 17,378 repositories. In total, this dataset includes:
54,252,380 issue comments (from 13,458,208 issues)
979,642 commit comments (from 49,710,108 commits)
33,408,215 pull request comments (from 12,680,373 pull requests)
Warning: The uploaded dataset is compressed from 185GB down to 25.1GB.
Purpose
The purpose of this dataset (corpus) is to provide a large dataset of software developer comments (natural language) for research. We intend to use this data in our own research, but we hope it will be helpful for other researchers.
Collection Process
Full implementation details can be found in the following publication:
Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.
Data was downloaded using GitHub's GraphQL API via requests made with Python's requests library. We targeted 17,491 repositories with the following criteria:
At least 850 stars.
Primary language in the Top 50 from the TIOBE Index and/or listed as "popular" in GitHub's advanced search. Note that we collected the list of languages on August 31, 2021.
Due to design decisions made by GitHub, we could only get a list of at most 1,000 repositories for each target language. Comments from 113 repositories could not be downloaded for various reasons (failing API queries, JSONDecoderErrors, etc.). Eight target languages had no repositories matching the above criteria.
After collection using the GraphQL API, data was written to CSV using Python's csv.writer class. We highly recommend using Python's csv.reader to parse these CSV files as no newlines have been removed from developer comments.
88_million_developer_comments.zip
This zip file contains 135 CSV files; 3 per language. CSV names are formatted _.csv, with being the name of the primary language and being one of co (commits), is (issues), or pr (pull requests).
Languages included are: ABAP, Assembly, C, C# (C-Sharp), C++ (C-PlusPlus), Clojure, COBOL, CoffeeScript, CSS, Dart, D, DM, Elixir, Fortran, F# (F-Sharp), Go, Groovy, HTML, Java, JavaScript, Julia, Kotlin, Lisp, Lua, MATLAB, Nim, Objective-C, Pascal, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Rust, Scala, Scheme, Scratch, Shell, Swift, TSQL, TypeScript, VBScript, and VHDL.
Details on the columns in each CSV file are described in the provided README.md.
Detailed_Breakdown.ods
This spreadsheet contains specific details on how many repositories, commits, issues, pull requests, and comments are included in 88_million_developer_comments.zip.
Note On Completeness
We make no guarantee that every commit, issue, and/or pull request for each repository is included in this dataset. Due to the nature of the GraphQL API and data decoding difficulties, sometimes a query failed and that data is not included here.
Versioning
v1.1: The original corpus had duplicate header rows in the CSV files. This has been fixed.
v1.0: Original corpus.
Contact
Please contact Benjamin S. Meyers (email) with questions about this data and its collection.
Acknowledgments
Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).
This data was collected using the compute resources from the Research Computing department at the Rochester Institute of Technology. doi:10.34788/0S3G-QD15
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The content presented in this repository accompanies the paper "Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central" authored by Lina Ochoa, Thomas Degueule, Jean-RΓ©my Falleri, and Jurgen Vinju. The paper was submitted and accepted in the Journal of Empirical Software Engineering (EMSE'21). This study is an external and differentiated replication study of the paper "Semantic Versioning and Impact of Breaking Changes in the Maven Repository" presented by Steven Raemaekers, Arie van Deursen, and Joost Visser.
Content
README.md: document with the main description to start exploring the bundle.
data.zip: contains the datasets used within the study. These datasets must be used to get the same results like the ones presented in the article.
maven-api-dataset.zip: contains the code used to generate the datasets and to analyse the obtained results. Check the README.md file within this bundle for more information.
Relevant Links
maven-api-dataset repository: https://github.com/tdegueul/maven-api-dataset
maracas repository: https://github.com/crossminer/maracas
Companion webpage: https://crossminer.github.io/maracas/2021/08/16/emse21/
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AOL Dataset for Browsing History and Topics of Interest
This record provides the datasets of the paper The Privacy-Utility Trade-off in the Topics API.
The datasets generating code and the experimental results can be found in 10.5281/zenodo.11032231 (github.com/nunesgh/topics-api-analysis).
License
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
I am writing articles on League of Legends and Machine Learning. You can find the full repository where this information is stored here.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains data behind the story A Statistical Analysis of the Work of Bob Ross.
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
Cover photo by Alex Kotomanov on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Dataset of Bot and Human Activities in GitHub
This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.
The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.
Files description
The following files are provided as part of the archive:
Example
Below is an example of a Closing pull request activity:
{
"date": "2022-11-25T18:49:09+00:00",
"activity": "Closing pull request",
"contributor": "typescript-bot",
"repository": "DefinitelyTyped/DefinitelyTyped",
"comment": {
"length": 249,
"GH_node": "IC_kwDOAFz6BM5PJG7l"
},
"pull_request": {
"id": 62328,
"title": "[qunit] Add `test.each()`",
"created_at": "2022-09-19T17:34:28+00:00",
"status": "closed",
"closed_at": "2022-11-25T18:49:08+00:00",
"merged": false,
"GH_node": "PR_kwDOAFz6BM4_N5ib"
},
"conversation": {
"comments": 19
},
"payload": {
"pr_commits": 1,
"pr_changed_files": 5
}
}
List of activity types
In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.
List of fields
Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.
For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.
Properties
string
string
string
string
object
integer
string
string
string
string
, null
boolean
string
object
integer
string
string
string
string
, null
boolean
string
object
string
string
object
integer