100+ datasets found
  1. Data from: A Dataset of Bot and Human Activities in GitHub

    • zenodo.org
    json, txt
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470
    Explore at:
    json, txtAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Dataset of Bot and Human Activities in GitHub

    This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

    The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

    Files description

    The following files are provided as part of the archive:

    • bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;
    • human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);
    • JsonSchema.json - A JSON schema that validates the above datasets;
    • bots.txt - A TEXT file containing login names of all the 350 bots

    Example

    Below is an example of a Closing pull request activity:

    {
     "date": "2022-11-25T18:49:09+00:00",
     "activity": "Closing pull request",
     "contributor": "typescript-bot",
     "repository": "DefinitelyTyped/DefinitelyTyped",
     "comment": {
       "length": 249,
       "GH_node": "IC_kwDOAFz6BM5PJG7l"
     },
     "pull_request": {
       "id": 62328,
       "title": "[qunit] Add `test.each()`",
       "created_at": "2022-09-19T17:34:28+00:00",
       "status": "closed",
       "closed_at": "2022-11-25T18:49:08+00:00",
       "merged": false,
       "GH_node": "PR_kwDOAFz6BM4_N5ib"
     },
     "conversation": {
       "comments": 19
     },
     "payload": {
       "pr_commits": 1,
       "pr_changed_files": 5
     }
    }

    List of activity types

    In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

    List of fields

    Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

    For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

    Properties

    • date
      • Date on which the activity is performed
      • Type: string
      • e.g., "2022-11-25T09:55:19+00:00"
      • String format must be a "date-time"

    • activity
      • The activity performed by the contributor
      • Type: string
      • e.g., "Commenting pull request"
    • contributor
      • The login name of the contributor who performed this activity
      • Type: string
      • e.g., "analysis-bot", "anonymised" in the case of a human contributor
    • repository
      • The repository in which the activity is performed
      • Type: string
      • e.g., "apache/spark", "anonymised" in the case of a human contributor
    • issue
      • Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue
      • Type: object
      • Properties
        • id
          • Issue number
          • Type: integer
          • e.g., 35471
        • title
          • Issue title
          • Type: string
          • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
        • created_at
          • The date on which this issue is created
          • Type: string
          • e.g., "2022-11-10T13:07:23+00:00"
          • String format must be a "date-time"
        • status
          • Current state of the issue
          • Type: string
          • "open" or "closed"
        • closed_at
          • The date on which this issue is closed. "null" will be provided if the issue is open
          • Types: string, null
          • e.g., "2022-11-25T10:42:39+00:00"
          • String format must be a "date-time"
        • resolved
          • The issue is resolved or not_planned/still open
          • Type: boolean
          • true or false
        • GH_node
          • The GitHub node of this issue
          • Type: string
          • e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor
    • pull_request
      • Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code
      • Type: object
      • Properties
        • id
          • Pull request number
          • Type: integer
          • e.g., 35471
        • title
          • Pull request title
          • Type: string
          • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
        • created_at
          • The date on which this pull request is created
          • Type: string
          • e.g., "2022-11-10T13:07:23+00:00"
          • String format must be a "date-time"
        • status
          • Current state of the pull request
          • Type: string
          • "open" or "closed"
        • closed_at
          • The date on which this pull request is closed. "null" will be provided if the pull request is open
          • Types: string, null
          • e.g., "2022-11-25T10:42:39+00:00"
          • String format must be a "date-time"
        • merged
          • The PR is merged or rejected/still open
          • Type: boolean
          • true or false
        • GH_node
          • The GitHub node of this pull request
          • Type: string
          • e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor
    • review
      • Pull request review information - provided for Reviewing code
      • Type: object
      • Properties
        • status
          • Status of the review
          • Type: string
          • "changes_requested" or "approved" or "dismissed"
        • GH_node
          • The GitHub node of this review
          • Type: string
          • e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor
    • conversation
      • Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request
      • Type: object
      • Properties
        • comments
          • Number of comments present in the corresponding issue or pull request
          • Type: integer
          • e.g.,

  2. d

    Placenames API GitHub code repository

    • data.gov.au
    html
    Updated Oct 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Commonwealth of Australia (Geoscience Australia) (2020). Placenames API GitHub code repository [Dataset]. https://data.gov.au/dataset/ds-ga-251fc23d-c59c-4a1a-be04-2c25e4e7d45b
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Oct 1, 2020
    Dataset provided by
    Commonwealth of Australia (Geoscience Australia)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The source code for the National Composite Gazetteer Linked data API Code located at: https://github.com/GeoscienceAustralia/placenames-dataset for live API instance at: http://linked.data.gov.au/dat…Show full descriptionThe source code for the National Composite Gazetteer Linked data API Code located at: https://github.com/GeoscienceAustralia/placenames-dataset for live API instance at: http://linked.data.gov.au/dataset/placenames

  3. f

    A Representative User-centric GitHub Developers Dataset for Malicious...

    • figshare.com
    png
    Updated Dec 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yushan Liu (2022). A Representative User-centric GitHub Developers Dataset for Malicious Account Detection [Dataset]. http://doi.org/10.6084/m9.figshare.21789566.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Dec 29, 2022
    Dataset provided by
    figshare
    Authors
    Yushan Liu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Using GitHub APIs, we construct an unbiased dataset of over 10 million GitHub users. The data was collected between Jul. 20 and Aug. 27, 2018, covering 10,000 users. Each data entry is stored in JSON format, representing one GitHub user, and containing the descriptive information in the user’s profile page, the information of her commit activities and created/forked public repositories.

    We provide a sample of dataset in 'Github_dataset_sample.json'. If you are interested in using the full dataset, please contact chenyang AT fudan.edu.cn to obtain the full dataset for research purposes only.

    Please cite the following paper when using the dataset: Qingyuan Gong, Yushan Liu, Jiayun Zhang, Yang Chen, Qi Li, Yu Xiao, Xin Wang, Pan Hui. Detecting Malicious Accounts in Online Developer Communities Using Deep Learning. To appear: IEEE Transactions on Knowledge and Data Engineering.

  4. PIPr: A Dataset of Public Infrastructure as Code Programs

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. http://doi.org/10.5281/zenodo.10173400
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Sokolowski; Daniel Sokolowski; David Spielmann; David Spielmann; Guido Salvaneschi; Guido Salvaneschi
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.

    Contents:

    • metadata.zip: The dataset metadata and analysis results as CSV files.
    • scripts-and-logs.zip: Scripts and logs of the dataset creation.
    • LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text.
    • README.md: This document.
    • redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program.

    This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.

    Metadata

    The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.

    repositories.csv:

    • ID (integer): GitHub repository ID
    • url (string): GitHub repository URL
    • downloaded (boolean): Whether cloning the repository succeeded
    • name (string): Repository name
    • description (string): Repository description
    • licenses (string, list of strings): Repository licenses
    • redistributable (boolean): Whether the repository's licenses permit redistribution
    • created (string, date & time): Time of the repository's creation
    • updated (string, date & time): Time of the last update to the repository
    • pushed (string, date & time): Time of the last push to the repository
    • fork (boolean): Whether the repository is a fork
    • forks (integer): Number of forks
    • archive (boolean): Whether the repository is archived
    • programs (string, list of strings): Project file path of each IaC program in the repository

    programs.csv:

    • ID (string): Project file path of the IaC program
    • repository (integer): GitHub repository ID of the repository containing the IaC program
    • directory (string): Path of the directory containing the IaC program's project file
    • solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi")
    • language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml")
    • name (string): IaC program name
    • description (string): IaC program description
    • runtime (string): Runtime string of the IaC program
    • testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • tests (string, list of strings): File paths of IaC program's tests

    testing-files.csv:

    • file (string): Testing file path
    • language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript")
    • techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking")
    • keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(")
    • program (string): Project file path of the testing file's IaC program

    Dataset Creation

    scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

    1. A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries.
    2. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below).
    3. Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses.
    4. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories.

    Searching Repositories

    The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

    1. Github access token.
    2. Name of the CSV output file.
    3. Filename to search for.
    4. File extensions to search for, separated by commas.
    5. Min file size for the search (for all files: 0).
    6. Max file size for the search or * for unlimited (for all files: *).

    Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/

    AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html

    CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup

    Limitations

    The script uses the GitHub code search API and inherits its limitations:

    • Only forks with more stars than the parent repository are included.
    • Only the repositories' default branches are considered.
    • Only files smaller than 384 KB are searchable.
    • Only repositories with fewer than 500,000 files are considered.
    • Only repositories that have had activity or have been returned in search results in the last year are considered.

    More details: https://docs.github.com/en/search-github/searching-on-github/searching-code

    The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api

    Downloading Repositories

    download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

    1. Name of the repositories CSV files generated through search-repositories.py, separated by commas.
    2. Output directory to download the repositories to.
    3. Name of the CSV output file.

    The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.

  5. h

    code-parrot-github-code

    • huggingface.co
    Updated Mar 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macrocosmos (2022). code-parrot-github-code [Dataset]. https://huggingface.co/datasets/macrocosm-os/code-parrot-github-code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2022
    Dataset authored and provided by
    Macrocosmos
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    GitHub Code Dataset

      Dataset Description
    

    The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

      How to use it
    

    The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the following… See the full description on the dataset page: https://huggingface.co/datasets/macrocosm-os/code-parrot-github-code.

  6. Data.gov CKAN API

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data.gov (2020). Data.gov CKAN API [Dataset]. https://catalog.data.gov/dataset/data-gov-ckan-api
    Explore at:
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Data.govhttps://data.gov/
    Description

    The data.gov catalog is powered by CKAN, a powerful open source data platform that includes a robust API. Please be aware that data.gov and the data.gov CKAN API only contain metadata about datasets. This metadata includes URLs and descriptions of datasets, but it does not include the actual data within each dataset.

  7. GitHub data privacy commits from JSS 2025

    • zenodo.org
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Kapitsaki; Georgia Kapitsaki; Maria Papoutsoglou; Maria Papoutsoglou (2025). GitHub data privacy commits from JSS 2025 [Dataset]. http://doi.org/10.5281/zenodo.15532947
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Georgia Kapitsaki; Georgia Kapitsaki; Maria Papoutsoglou; Maria Papoutsoglou
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset on commits (and repositories) on GitHub making reference to data privacy legislation (covering laws: GDPR, CCPA, CPRA, UK DPA).

    The dataset contains:
    + all_commits_info_merged-v2-SHA.csv : commits information as collected from various GitHub REST API calls (all data merged together).
    + repos_info_merged_USED-v2_with_loc.csv: repository information with some calculated data.
    + top-70-repos-commits-for-manual-check_commits-2coders.xlsx: results of the manual coding of the commits of the 70 most popular repositories in dataset.
    + user-rights-Ο‰3.csv: different terms for user rights teriminology in legislation.
    + github_commits_analysis_replication.r: main analysis pipeline covering all RQs in the R programming language.

    In order to perform also the initial data collection, the GitHub REST API can be used, collecting data using time intervals, for instance:
    https://api.github.com/search/commits?q=%22GDPR%22+committer-date:2018-05-25..2018-05-30&sort=committer-date&order=asc&per_page=100&page=1

    This dataset accompanies the following publication, so please cite it accordingly:

    Georgia M. Kapitsaki, Maria Papoutsoglou, Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era, accepted for publication at Elsevier Journal of Systems & Software, 2025.

  8. GitHub Social Network

    • kaggle.com
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gitanjali Wadhwa
    Description

    Description

    An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

    Properties

    • Directed: No.
    • Node features: Yes.
    • Edge features: No.
    • Node labels: Yes. Binary-labeled.
    • Temporal: No.
    • Nodes: 37,700
    • Edges: 289,003
    • Density: 0.001
    • Transitvity: 0.013

    Possible Tasks

    • Binary node classification
    • Link prediction
    • Community detection
    • Network visualisation
  9. f

    COVID-19 Twitter Dataset

    • figshare.com
    • borealisdata.ca
    zip
    Updated Oct 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social Media Lab (2021). COVID-19 Twitter Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.16713448.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 2, 2021
    Dataset provided by
    figshare
    Authors
    Social Media Lab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The current dataset contains Tweet IDs for tweets mentioning "COVID" (e.g., COVID-19, COVID19) and shared between March and July of 2020.Sampling Method: hourly requests sent to Twitter Search API using Social Feed Manager, an open source software that harvests social media data and related content from Twitter and other platforms.NOTE: 1) In accordance with Twitter API Terms, only Tweet IDs are provided as part of this dataset. 2) To recollect tweets based on the list of Tweet IDs contained in these datasets, you will need to use tweet 'rehydration' programs like Hydrator (https://github.com/DocNow/hydrator) or Python library Twarc (https://github.com/DocNow/twarc). 3) This dataset, like most datasets collected via the Twitter Search API, is a sample of the available tweets on this topic and is not meant to be comprehensive. Some COVID-related tweets might not be included in the dataset either because the tweets were collected using a standardized but intermittent (hourly) sampling protocol or because tweets used hashtags/keywords other than COVID (e.g., Coronavirus or #nCoV). 4) To broaden this sample, consider comparing/merging this dataset with other COVID-19 related public datasets such as: https://github.com/thepanacealab/covid19_twitter https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset https://github.com/echen102/COVID-19-TweetIDs

  10. h

    github-issues-simple

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikerachel521, github-issues-simple [Dataset]. https://huggingface.co/datasets/rachel521/github-issues-simple
    Explore at:
    Authors
    Mikerachel521
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset is sourced from the issues within the official datasets repository on GitHub. We collected the corresponding issues and their comments via the GitHub API, performed some light data cleaning, and finally, randomly selected 1,000 samples to create this dataset.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More… See the full description on the dataset page: https://huggingface.co/datasets/rachel521/github-issues-simple.

  11. Data from: OpenAIRE Graph Beginner's Kit Dataset

    • zenodo.org
    • pub.uni-bielefeld.de
    tar
    Updated Aug 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou (2023). OpenAIRE Graph Beginner's Kit Dataset [Dataset]. http://doi.org/10.5281/zenodo.8223812
    Explore at:
    tarAvailable download formats
    Dataset updated
    Aug 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Miriam Baglioni; Miriam Baglioni; Claudio Atzori; Claudio Atzori; Alessia Bardi; Alessia Bardi; Gianbattista Bloisi; Sandro La Bruzzo; Sandro La Bruzzo; Paolo Manghi; Paolo Manghi; Harry Dimitropoulos; Andrea Mannocci; Andrea Mannocci; Ioannis Foufoulas; Marek Horst; Michele De Bonis; Michele De Bonis; Michele Artini; Thanasis Vergoulis; Thanasis Vergoulis; Serafeim Chatzopoulos; Serafeim Chatzopoulos; Dimitris Pierrakos; Antonis Lempesis; Antonis Lempesis; Andreas Czerniak; Andreas Czerniak; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou; Argiro Kokogiannaki; Argiro Kokogiannaki; Gianbattista Bloisi; Harry Dimitropoulos; Ioannis Foufoulas; Marek Horst; Michele Artini; Dimitris Pierrakos; Jochen Schirrwagen; Alexandros Ioannidis; Katerina Iatropoulou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The OpenAIRE Graph is an Open Access dataset containing metadata about research products (literature, datasets, software, etc.) linked to other entities of the research ecosystem like organisations, project grants, and data sources.

    The large size of the OpenAIRE Graph is a major impediment for beginners to familiarise with the underlying data model and explore its contents. Working with the Graph in its full size typically requires access to a huge distributed computing infrastructure which cannot be easily accessible to everyone.

    The OpenAIRE Beginner’s Kit aims to address this issue. It consists of two components:

    • A subset of the OpenAIRE Graph composed of the research products published between 2022-12-28 and 2023-07-31, all the entities connected to them and the respective relationships. The subset is composed of the following parts:

      • publication.tar: metadata records about research literature (includes types of publications listed here)

      • dataset.tar: metadata records about research data (includes the subtypes listed here)

      • software.tar: metadata records about research software (includes the subtypes listed here)

      • otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)

      • organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.

      • datasource.tar: metadata records about data sources whose content is available in the OpenAIRE Graph. They include institutional and thematic repositories, journals, aggregators, funders' databases.

      • project.tar: metadata records about project grants.

      • relation.tar: metadata records about relations between entities in the graph.

      • communities_infrastructures.tar: metadata records about research communities and research infrastructures

        Each file is a tar archive containing gz files, each with one json per line. Each json is compliant to the schema available at http://doi.org/10.5281/zenodo.8238874.

    • The code to analyse the data. It is available on GitHub. Just download the archive, unzip/untar it and follow the instruction on the README file (no need to clone the GitHub repository)


  12. d

    Dataset metadata of known Dataverse installations

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautier, Julian
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized β”œβ”€β”€ csv_files_with_metadata_from_most_known_dataverse_installations β”‚ β”œβ”€β”€ author(citation).csv β”‚ β”œβ”€β”€ basic.csv β”‚ β”œβ”€β”€ contributor(citation).csv β”‚ β”œβ”€β”€ ... β”‚ └── topic_classification(citation).csv β”œβ”€β”€ dataverse_json_metadata_from_each_known_dataverse_installation β”‚ β”œβ”€β”€ Abacus_2022.10.02_17.11.19.zip β”‚ β”œβ”€β”€ dataset_pids_Abacus_2022.10.02_17.11.19.csv β”‚ β”œβ”€β”€ Dataverse_JSON_metadata_2022.10.02_17.11.19 β”‚ β”œβ”€β”€ hdl_11272.1_AB2_0AQZNT_v1.0.json β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ metadatablocks_v5.6 β”‚ β”œβ”€β”€ astrophysics_v5.6.json β”‚ β”œβ”€β”€ biomedical_v5.6.json β”‚ β”œβ”€β”€ citation_v5.6.json β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ socialscience_v5.6.json β”‚ β”œβ”€β”€ ACSS_Dataverse_2022.10.02_17.26.19.zip β”‚ β”œβ”€β”€ ADA_Dataverse_2022.10.02_17.26.57.zip β”‚ β”œβ”€β”€ Arca_Dados_2022.10.02_17.44.35.zip β”‚ β”œβ”€β”€ ... β”‚ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.

  13. Data from: GHTraffic: A Dataset for Reproducible Research in...

    • zenodo.org
    zip
    Updated Aug 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen (2020). GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing [Dataset]. http://doi.org/10.5281/zenodo.3748921
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thilini Bhagya; Jens Dietrich; Hans Guesgen; Thilini Bhagya; Jens Dietrich; Hans Guesgen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the latest version of the GHTraffic project. The main aim is to model a variety of transaction sequences to reflect more complex service behaviour.

    This version consists of a single edition collected from the google/guava repository.

    The entire data generation process is quite similar to the original GHTraffic design. But it incorporates minor changes to the process of synthetic data generation where it uses a random date after successfully posting a resource to make up the request and response for all of the HTTP methods. It also adds yet another subset of unsuccessful transactions by stipulating requests before resource creation is successful.

    This results in a far more dynamic series of transactions to named resources.

    Scripts used for datasets construction are accessible from the repository.

  14. MH-1M Dataset

    • figshare.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hendrio BraganΓ§a; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa (2025). MH-1M Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28355897.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    figshare
    Authors
    Hendrio BraganΓ§a; Vanderson Rocha; Joner Assolin; Diego Kreutz; Eduardo Feitosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The rapid and widespread increase of Android malware presents substantial obstacles to cybersecurity research. In order to revolutionize the field of malware research, we present the MH-1M dataset, which is a thorough compilation of 1,340,515 APK samples. This dataset encompasses a wide range of diverse attributes and metadata, offering a comprehensive perspective. The utilization of the VirusTotal API guarantees precise assessment of threats by amalgamating various detection techniques. Our research indicates that MH-1M is a highly current dataset that provides valuable insights into the changing nature of malware.MH-1M consists of 23,247 features that cover a wide range of application behavior, from intents::accept to apicalls::landroid/window/splashscreenview.remove. The features are categorized into four primary classifications:Feature TypesValuesAPICalls22,394Intents407OPCodes232Permissions214The dataset is stored efficiently, utilizing a memory capacity of 29.0 GB, which showcases its substantial yet controllable magnitude. The dataset consists of 1,221,421 benign applications and 119,094 malware applications, ensuring a balanced representation for accurate malware detection and analysis.The MH-1M repository also offers a wide variety of metadata from APKs, providing useful data into the development of malicious software over a period of more than ten years. The Android features include a wide variety of metadata, which includes SHA256 hashes, file names, package names, compilation APIs, and various other details. This GitHub repository contains over 400GB of valuable data, making it the largest and most comprehensive dataset available for advancing research and development in Android malware detection.

  15. o

    Social Media Profile Links by Name

    • openwebninja.com
    json
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenWeb Ninja (2025). Social Media Profile Links by Name [Dataset]. https://www.openwebninja.com/api/social-links-search
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 2, 2025
    Dataset authored and provided by
    OpenWeb Ninja
    Area covered
    Worldwide
    Description

    This dataset provides comprehensive social media profile links discovered through real-time web search. It includes profiles from major social networks like Facebook, TikTok, Instagram, Twitter, LinkedIn, Youtube, Pinterest, Github and more. The data is gathered through intelligent search algorithms and pattern matching. Users can leverage this dataset for social media research, influencer discovery, social presence analysis, and social media marketing. The API enables efficient discovery of social profiles across multiple platforms. The dataset is delivered in a JSON format via REST API.

  16. Z

    88.6 Million Developer Comments from GitHub

    • data.niaid.nih.gov
    Updated Jan 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin S. Meyers; Andrew Meneely (2024). 88.6 Million Developer Comments from GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5596536
    Explore at:
    Dataset updated
    Jan 4, 2024
    Dataset provided by
    Rochester Institute of Technology
    Authors
    Benjamin S. Meyers; Andrew Meneely
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This is a collection of developer comments from GitHub issues, commits, and pull requests. We collected 88,640,237 developer comments from 17,378 repositories. In total, this dataset includes:

    54,252,380 issue comments (from 13,458,208 issues)

    979,642 commit comments (from 49,710,108 commits)

    33,408,215 pull request comments (from 12,680,373 pull requests)

    Warning: The uploaded dataset is compressed from 185GB down to 25.1GB.

    Purpose

    The purpose of this dataset (corpus) is to provide a large dataset of software developer comments (natural language) for research. We intend to use this data in our own research, but we hope it will be helpful for other researchers.

    Collection Process

    Full implementation details can be found in the following publication:

    Benjamin S. Meyers. Human Error Assessment in Software Engineering. Rochester Institute of Technology. 2023.

    Data was downloaded using GitHub's GraphQL API via requests made with Python's requests library. We targeted 17,491 repositories with the following criteria:

    At least 850 stars.

    Primary language in the Top 50 from the TIOBE Index and/or listed as "popular" in GitHub's advanced search. Note that we collected the list of languages on August 31, 2021.

    Due to design decisions made by GitHub, we could only get a list of at most 1,000 repositories for each target language. Comments from 113 repositories could not be downloaded for various reasons (failing API queries, JSONDecoderErrors, etc.). Eight target languages had no repositories matching the above criteria.

    After collection using the GraphQL API, data was written to CSV using Python's csv.writer class. We highly recommend using Python's csv.reader to parse these CSV files as no newlines have been removed from developer comments.

    88_million_developer_comments.zip

    This zip file contains 135 CSV files; 3 per language. CSV names are formatted _.csv, with being the name of the primary language and being one of co (commits), is (issues), or pr (pull requests).

    Languages included are: ABAP, Assembly, C, C# (C-Sharp), C++ (C-PlusPlus), Clojure, COBOL, CoffeeScript, CSS, Dart, D, DM, Elixir, Fortran, F# (F-Sharp), Go, Groovy, HTML, Java, JavaScript, Julia, Kotlin, Lisp, Lua, MATLAB, Nim, Objective-C, Pascal, Perl, PHP, PowerShell, Prolog, Python, R, Ruby, Rust, Scala, Scheme, Scratch, Shell, Swift, TSQL, TypeScript, VBScript, and VHDL.

    Details on the columns in each CSV file are described in the provided README.md.

    Detailed_Breakdown.ods

    This spreadsheet contains specific details on how many repositories, commits, issues, pull requests, and comments are included in 88_million_developer_comments.zip.

    Note On Completeness

    We make no guarantee that every commit, issue, and/or pull request for each repository is included in this dataset. Due to the nature of the GraphQL API and data decoding difficulties, sometimes a query failed and that data is not included here.

    Versioning

    v1.1: The original corpus had duplicate header rows in the CSV files. This has been fixed.

    v1.0: Original corpus.

    Contact

    Please contact Benjamin S. Meyers (email) with questions about this data and its collection.

    Acknowledgments

    Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).

    This data was collected using the compute resources from the Research Computing department at the Rochester Institute of Technology. doi:10.34788/0S3G-QD15

  17. Z

    Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lina Ochoa; Thomas Degueule; Jean-RΓ©my Falleri; Jurgen Vinju (2021). Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central (Dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5221839
    Explore at:
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Univ. Bordeaux, Bordeaux INP, CNRS, LaBRI
    Eindhoven University of Technology
    Centrum Wiskunde & Informatica
    Univ. Bordeaux, Bordeaux INP, CNRS, LaBRI, Institut Universitaire de France
    Authors
    Lina Ochoa; Thomas Degueule; Jean-RΓ©my Falleri; Jurgen Vinju
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The content presented in this repository accompanies the paper "Breaking Bad? Semantic Versioning and Impact of Breaking Changes in Maven Central" authored by Lina Ochoa, Thomas Degueule, Jean-RΓ©my Falleri, and Jurgen Vinju. The paper was submitted and accepted in the Journal of Empirical Software Engineering (EMSE'21). This study is an external and differentiated replication study of the paper "Semantic Versioning and Impact of Breaking Changes in the Maven Repository" presented by Steven Raemaekers, Arie van Deursen, and Joost Visser.

    Content

    README.md: document with the main description to start exploring the bundle.

    data.zip: contains the datasets used within the study. These datasets must be used to get the same results like the ones presented in the article.

    maven-api-dataset.zip: contains the code used to generate the datasets and to analyse the obtained results. Check the README.md file within this bundle for more information.

    Relevant Links

    maven-api-dataset repository: https://github.com/tdegueul/maven-api-dataset

    maracas repository: https://github.com/crossminer/maracas

    Companion webpage: https://crossminer.github.io/maracas/2021/08/16/emse21/

  18. Z

    Data from: AOL Dataset for Browsing History and Topics of Interest

    • data.niaid.nih.gov
    Updated Jun 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nunes, Gabriel Henrique (2024). AOL Dataset for Browsing History and Topics of Interest [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11029571
    Explore at:
    Dataset updated
    Jun 24, 2024
    Dataset authored and provided by
    Nunes, Gabriel Henrique
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AOL Dataset for Browsing History and Topics of Interest

    This record provides the datasets of the paper The Privacy-Utility Trade-off in the Topics API.

    The datasets generating code and the experimental results can be found in 10.5281/zenodo.11032231 (github.com/nunesgh/topics-api-analysis).

    License

    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

  19. League of Legends Master+ Players

    • kaggle.com
    zip
    Updated Sep 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Guillermo Martinez (2021). League of Legends Master+ Players [Dataset]. https://www.kaggle.com/jasperan/league-of-legends-master-players
    Explore at:
    zip(11694163 bytes)Available download formats
    Dataset updated
    Sep 22, 2021
    Authors
    Ignacio Guillermo Martinez
    Description

    GitHub repository

    Click Here

    Why?

    I am writing articles on League of Legends and Machine Learning. You can find the full repository where this information is stored here.

  20. FiveThirtyEight Bob Ross Dataset

    • kaggle.com
    Updated Apr 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FiveThirtyEight (2019). FiveThirtyEight Bob Ross Dataset [Dataset]. https://www.kaggle.com/datasets/fivethirtyeight/fivethirtyeight-bob-ross-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    FiveThirtyEight
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Content

    Bob Ross

    This folder contains data behind the story A Statistical Analysis of the Work of Bob Ross.

    Context

    This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using GitHub's API and Kaggle's API.

    This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.

    Cover photo by Alex Kotomanov on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470
Organization logo

Data from: A Dataset of Bot and Human Activities in GitHub

Related Article
Explore at:
json, txtAvailable download formats
Dataset updated
Jan 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A Dataset of Bot and Human Activities in GitHub

This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

Files description

The following files are provided as part of the archive:

  • bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;
  • human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);
  • JsonSchema.json - A JSON schema that validates the above datasets;
  • bots.txt - A TEXT file containing login names of all the 350 bots

Example

Below is an example of a Closing pull request activity:

{
 "date": "2022-11-25T18:49:09+00:00",
 "activity": "Closing pull request",
 "contributor": "typescript-bot",
 "repository": "DefinitelyTyped/DefinitelyTyped",
 "comment": {
   "length": 249,
   "GH_node": "IC_kwDOAFz6BM5PJG7l"
 },
 "pull_request": {
   "id": 62328,
   "title": "[qunit] Add `test.each()`",
   "created_at": "2022-09-19T17:34:28+00:00",
   "status": "closed",
   "closed_at": "2022-11-25T18:49:08+00:00",
   "merged": false,
   "GH_node": "PR_kwDOAFz6BM4_N5ib"
 },
 "conversation": {
   "comments": 19
 },
 "payload": {
   "pr_commits": 1,
   "pr_changed_files": 5
 }
}

List of activity types

In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

List of fields

Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

Properties

  • date
    • Date on which the activity is performed
    • Type: string
    • e.g., "2022-11-25T09:55:19+00:00"
    • String format must be a "date-time"

  • activity
    • The activity performed by the contributor
    • Type: string
    • e.g., "Commenting pull request"
  • contributor
    • The login name of the contributor who performed this activity
    • Type: string
    • e.g., "analysis-bot", "anonymised" in the case of a human contributor
  • repository
    • The repository in which the activity is performed
    • Type: string
    • e.g., "apache/spark", "anonymised" in the case of a human contributor
  • issue
    • Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue
    • Type: object
    • Properties
      • id
        • Issue number
        • Type: integer
        • e.g., 35471
      • title
        • Issue title
        • Type: string
        • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
      • created_at
        • The date on which this issue is created
        • Type: string
        • e.g., "2022-11-10T13:07:23+00:00"
        • String format must be a "date-time"
      • status
        • Current state of the issue
        • Type: string
        • "open" or "closed"
      • closed_at
        • The date on which this issue is closed. "null" will be provided if the issue is open
        • Types: string, null
        • e.g., "2022-11-25T10:42:39+00:00"
        • String format must be a "date-time"
      • resolved
        • The issue is resolved or not_planned/still open
        • Type: boolean
        • true or false
      • GH_node
        • The GitHub node of this issue
        • Type: string
        • e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor
  • pull_request
    • Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code
    • Type: object
    • Properties
      • id
        • Pull request number
        • Type: integer
        • e.g., 35471
      • title
        • Pull request title
        • Type: string
        • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
      • created_at
        • The date on which this pull request is created
        • Type: string
        • e.g., "2022-11-10T13:07:23+00:00"
        • String format must be a "date-time"
      • status
        • Current state of the pull request
        • Type: string
        • "open" or "closed"
      • closed_at
        • The date on which this pull request is closed. "null" will be provided if the pull request is open
        • Types: string, null
        • e.g., "2022-11-25T10:42:39+00:00"
        • String format must be a "date-time"
      • merged
        • The PR is merged or rejected/still open
        • Type: boolean
        • true or false
      • GH_node
        • The GitHub node of this pull request
        • Type: string
        • e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor
  • review
    • Pull request review information - provided for Reviewing code
    • Type: object
    • Properties
      • status
        • Status of the review
        • Type: string
        • "changes_requested" or "approved" or "dismissed"
      • GH_node
        • The GitHub node of this review
        • Type: string
        • e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor
  • conversation
    • Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request
    • Type: object
    • Properties
      • comments
        • Number of comments present in the corresponding issue or pull request
        • Type: integer
        • e.g.,

Search
Clear search
Close search
Google apps
Main menu