100+ datasets found
  1. GitHub Activity Data

    • console.cloud.google.com
    Updated Jun 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:GitHub&inv=1&invt=Ab2YDQ (2022). GitHub Activity Data [Dataset]. https://console.cloud.google.com/marketplace/product/github/github-repos
    Explore at:
    Dataset updated
    Jun 23, 2022
    Dataset provided by
    GitHubhttps://github.com/
    Googlehttp://google.com/
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  2. h

    github-code

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot, github-code [Dataset]. https://huggingface.co/datasets/codeparrot/github-code
    Explore at:
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The GitHub Code dataest consists of 115M code files from GitHub in 32 programming languages with 60 extensions totalling in 1TB of text data. The dataset was created from the GitHub dataset on BiqQuery.

  3. h

    github-code-clean

    • huggingface.co
    • opendatalab.com
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.

  4. h

    github-issues

    • huggingface.co
    • opendatalab.com
    Updated Mar 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lewis Tunstall (2025). github-issues [Dataset]. https://huggingface.co/datasets/lewtun/github-issues
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2025
    Authors
    Lewis Tunstall
    Description

    Dataset Card for GitHub Issues

      Dataset Summary
    

    GitHub Issues is a dataset consisting of GitHub issues and pull requests associated with the 🤗 Datasets repository. It is intended for educational purposes and can be used for semantic search or multilabel text classification. The contents of each GitHub issue are in English and concern the domain of datasets for NLP, computer vision, and beyond.

      Supported Tasks and Leaderboards
    

    For each of the tasks tagged… See the full description on the dataset page: https://huggingface.co/datasets/lewtun/github-issues.

  5. g

    Amazon review data 2018

    • nijianmo.github.io
    • cseweb.ucsd.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://nijianmo.github.io/amazon/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
  6. d

    Project GitHub

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Project GitHub [Dataset]. https://catalog.data.gov/dataset/project-github
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Release of CertWare was announced 23 Mar 2012 on: code.nasa.gov The announcement points to the Certware project on NASA’s GitHub repository at: nasa.github.com/CertWare The project site contains install instructions as an Eclipse feature, various tutorials and resources, and a link to the GitHub source repository. CertWare is released under the NASA Open Source Agreement (NOSA).

  7. P

    Public Git Archive Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Nov 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Public Git Archive Dataset [Dataset]. https://paperswithcode.com/dataset/public-git-archive
    Explore at:
    Dataset updated
    Nov 27, 2021
    Description

    The Public Git Archive is a dataset of 182,014 top-bookmarked Git repositories from GitHub totalling 6 TB. The dataset provides the source code of the projects, the related metadata, and development history.

  8. P

    GitHub-Python Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michihiro Yasunaga; Percy Liang (2021). GitHub-Python Dataset [Dataset]. https://paperswithcode.com/dataset/github-python
    Explore at:
    Dataset updated
    Jun 15, 2021
    Authors
    Michihiro Yasunaga; Percy Liang
    Description

    Repair AST parse (syntax) errors in Python code

  9. Readme files in 16,000,000 public GitHub repositories (October 2016)

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markovtsev Vadim; Markovtsev Vadim (2020). Readme files in 16,000,000 public GitHub repositories (October 2016) [Dataset]. http://doi.org/10.5281/zenodo.285419
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Markovtsev Vadim; Markovtsev Vadim
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Format

    index.csv.gz - CSV comma separated file with 3 columns:

    The flag is either "s" (readme found) or "r" (readme does not exist on the root directory level). Readme file name may be any from the list:

    "README.md", "readme.md", "Readme.md", "README.MD", "README.txt", "readme.txt", "Readme.txt", "README.TXT", "README", "readme", "Readme", "README.rst", "readme.rst", "Readme.rst", "README.RST"

    100 part-r-00xxx files are in "new" Hadoop API format with the following settings:

    1. inputFormatClass is org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

    2. keyClass is org.apache.hadoop.io.Text - repository name

    3. valueClass is org.apache.hadoop.io.BytesWritable - gzipped readme file

  10. Z

    Malware Repositories and Their Authors on GitHub

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tania, Nishat Ara (2024). Malware Repositories and Their Authors on GitHub [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10806592
    Explore at:
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Zhang, Qian
    Faloutsos, Michalis
    Tania, Nishat Ara
    Rokon, Md Omar Faruk
    Masud, Md Rayhanul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.

    Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.

    Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.

    We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.

    malware_repos.txt

    Purpose: This file contains a curated list of GitHub repositories identified as containing malware. These repositories were identified following the methodology outlined in the research paper "SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub."

    Contents: The file is structured as a simple text file, with each line representing a unique repository in the format username/reponame. This format allows for easy identification and access to each repository on GitHub for further analysis or review.

    Usage: The list serves as a critical resource for researchers and cybersecurity professionals interested in studying malware, understanding its distribution on platforms like GitHub, or developing defense mechanisms against such malicious content.

    obfuscated_github_user_dataset.csv

    Purpose: Accompanying the list of malware repositories, this CSV file contains detailed, albeit obfuscated, profile information of the GitHub users who authored these repositories. The obfuscation process has been applied to protect user privacy and comply with ethical standards, especially given the sensitive nature of associating individuals with potentially malicious activities.

    Contents: The dataset includes several columns representing different aspects of user profiles, such as obfuscated identifiers (e.g., ID, login, name), contact information (e.g., email, blog), and GitHub-specific metrics (e.g., followers count, number of public repositories). Notably, sensitive information has been masked or replaced with generic placeholders to prevent user identification.

    Usage: This dataset can be instrumental for researchers analyzing behaviors, patterns, or characteristics of users involved in creating malware repositories on GitHub. It provides a basis for statistical analysis, trend identification, or the development of predictive models, all while upholding the necessary ethical considerations.

  11. g

    Coronavirus (Covid-19) Data in the United States

    • github.com
    • openicpsr.org
    • +2more
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    New York Times, Coronavirus (Covid-19) Data in the United States [Dataset]. https://github.com/nytimes/covid-19-data
    Explore at:
    csvAvailable download formats
    Dataset provided by
    New York Times
    License

    https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE

    Description

    The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

    Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

    We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

    The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.

  12. Z

    GitHub developer behavior and repository evolution dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShengyuZhao (2020). GitHub developer behavior and repository evolution dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3648083
    Explore at:
    Dataset updated
    Feb 7, 2020
    Dataset provided by
    TianyiZhou
    ShengyuZhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this work, based on GitHub Archive project and repository mining tools, we process all available data into concise and structured format to generate GitHub developer behavior and repository evolution dataset. With the self-configurable interactive analysis tool provided by us, it will give us a macroscopic view of open source ecosystem evolution.

  13. g

    Data from: Data Science Problems

    • github.com
    • opendatalab.com
    Updated Feb 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Data Science Problems [Dataset]. https://github.com/microsoft/DataScienceProblems
    Explore at:
    Dataset updated
    Feb 8, 2022
    License

    https://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt

    Description

    Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.

  14. Z

    (No) Influence of Continuous Integration on the Development Activity in...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knack, Jascha (2020). (No) Influence of Continuous Integration on the Development Activity in GitHub Projects — Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1140260
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Knack, Jascha
    Baltes, Sebastian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.

    We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

    used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),

    were active for at least one year (365 days) before the first build with Travis CI (before_ci),

    used Travis CI at least for one year (during_ci),

    had commit or merge activity on the default branch in both of these phases, and

    used the default branch to trigger builds.

    To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.

    We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

    We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:

    have Java or Ruby as their project language

    used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)

    have commit activity for at least two years (730 days)

    are engineered software projects (at least 10 watchers)

    were not in the TravisTorrent dataset

    In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.

    This dataset contains the following files:

    tr_projects_sample_filtered_2.csv A CSV file with information about the 113 selected projects.

    tr_sample_commits_default_branch_before_ci.csv tr_sample_commits_default_branch_during_ci.csv One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_"). branch: The branch to which the commit was made. hash_value: The SHA1 hash value of the commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit.

    tr_sample_merges_default_branch_before_ci.csv tr_sample_merges_default_branch_during_ci.csv One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

    project: GitHub project name ("/" replaced by "_"). branch: The destination branch of the merge. hash_value: The SHA1 hash value of the merge commit. merged_commits: Unique hash value prefixes of the commits merged with this commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit. pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message). source_user: GitHub login name of the user who initiated the pull request (extracted from log message). source_branch : Source branch of the pull request (extracted from log message).

    comparison_project_sample_800.csv A CSV file with information about the 800 projects in the comparison sample.

    commits_default_branch_before_mid.csv commits_default_branch_after_mid.csv One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.

    merges_default_branch_before_mid.csv merges_default_branch_after_mid.csv One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.

  15. Data from: GHALogs: Large-Scale Dataset of GitHub Actions Runs

    • zenodo.org
    application/gzip, zip
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon (2024). GHALogs: Large-Scale Dataset of GitHub Actions Runs [Dataset]. http://doi.org/10.5281/zenodo.10154920
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Oct 2023
    Description

    In recent years, continuous integration and deployment (CI/CD) has become increasingly popular in both the open-source community and industry. Evaluating CI/CD performance is a critical aspect of software development, as it not only helps minimize execution costs but also ensures faster feedback for developers. Despite its importance, there is limited fine-grained knowledge about the performance of CI/CD processes—knowledge that is essential for identifying bottlenecks and optimization opportunities.
    Moreover, the availability of large-scale, publicly accessible datasets of CI/CD logs remains scarce. The few datasets that do exist are often outdated and lack comprehensive coverage. To address this gap, we introduce a new dataset comprising 116k CI/CD workflows executed using GitHub Actions (GHA) across 25k public code projects spanning 20 different programming languages.
    This dataset includes 513k workflow runs encompassing 2.3 million individual steps. For each workflow run, we provide detailed metadata along with complete run logs. To the best of our knowledge, this is the largest dataset of CI/CD runs that includes full log data. The inclusion of these logs enables more in-depth analysis of CI/CD pipelines, offering insights that cannot be gleaned solely from code repositories.
    We postulate that this dataset will facilitate future CI/CD pipeline behavior research through log-based analysis. Potential applications include performance evaluation (e.g., measuring task execution times) and root cause analysis (e.g., identifying reasons for pipeline failures).

  16. i

    Github Dataset for Authorship Attribution

    • ieee-dataport.org
    Updated Oct 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farzaneh Abazari (2021). Github Dataset for Authorship Attribution [Dataset]. https://ieee-dataport.org/documents/github-dataset-authorship-attribution
    Explore at:
    Dataset updated
    Oct 21, 2021
    Authors
    Farzaneh Abazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    919 java source codes written by 3

  17. f

    GitHub Issues and Comments

    • figshare.com
    html
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Donisan (2022). GitHub Issues and Comments [Dataset]. http://doi.org/10.6084/m9.figshare.20024303.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 8, 2022
    Dataset provided by
    figshare
    Authors
    Alexandra Donisan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset containing GitHub issues (that are labeled using technical debt keywords) together with their comments. Both issues and comments have their GitHub reactions. The dataset is a MongoDB exported JSON.

  18. w

    GitHub Fields

    • windsor.ai
    json
    Updated Jun 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Windsor.ai (2024). GitHub Fields [Dataset]. https://windsor.ai/data-field/github/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 5, 2024
    Dataset provided by
    Windsor.ai
    Variables measured
    Today, Source, teams.id, users.id, events.id, issues.id, tags.name, teams.url, users.url, events.org, and 736 more
    Description

    Auto-generated structured data of GitHub from table Fields

  19. h

    github-code-dataset

    • huggingface.co
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wayner Barrios (2024). github-code-dataset [Dataset]. https://huggingface.co/datasets/waybarrios/github-code-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2024
    Authors
    Wayner Barrios
    Description

    Dataset Card for "github-code-dataset"

    More Information needed

  20. Data from: A Panel Data Set of Cryptocurrency Development Activity on GitHub...

    • zenodo.org
    application/gzip, bin +2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rijnard van Tonder; Asher Trockman; Claire Le Goues; Rijnard van Tonder; Asher Trockman; Claire Le Goues (2020). A Panel Data Set of Cryptocurrency Development Activity on GitHub [Dataset]. http://doi.org/10.5281/zenodo.2595588
    Explore at:
    txt, application/gzip, bin, csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rijnard van Tonder; Asher Trockman; Claire Le Goues; Rijnard van Tonder; Asher Trockman; Claire Le Goues
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contents:

    • all-sorted-recovered-normalized-2018-01-21-to-2019-02-04.csv: CSV format of all data, sorted by date. This file contains some imputed values for missing data, and all fields across all repositories and normalized to "null". This is the most convenient form to use.
    • all-sorted-2018-01-21-to-2019-02-04.csv: CSV format of all, sorted by date. It is the raw data after processing the raw format.
    • raw-data-2018-01-21-to-2019-02-04.tar.gz: The raw format of data collected (S-expressions). Contains additional contributor data and CoinMarketCap data not currently in the CSV datasets.
    • recovered.patch: The modification on all-sorted-2018-01-21-to-2019-02-04.csv after recovering (imputing) data, showing what was recovered.
    • recovered-normalized.patch: The modification of all-sorted-2018-01-21-to-2019-02-04.csv after normalizing the recovered data set. Thus, patching all-sorted-2018-01-21-to-2019-02-04.csv with recovered.patch, then recovered-normalized.patch gives all-sorted-recovered-normalized-2018-01-21-to-2019-02-04.csv
    • missing-dates.txt: Days for which we missed GitHub data collection (partial or completely).

    Related publications:

    @inproceedings{van-tonder-crypto-oss-2019, 
     title = {{A Panel Data Set of Cryptocurrency Development Activity on GitHub}},
     booktitle = "International Conference on Mining Software Repositories",
     author = "{van~Tonder}, Rijnard and Trockman, Asher and {Le~Goues}, Claire",
     series = {MSR '19},
     year = 2019
    } 
    
    @inproceedings{trockman-striking-gold-2019, 
     title = {{Striking Gold in Software Repositories? An Econometric Study of Cryptocurrencies on GitHub}},
     booktitle = "International Conference on Mining Software Repositories", author = "Trockman, Asher and {van~Tonder}, Rijnard and Vasilescu, Bogdan",
     series = {MSR '19},
     year = 2019
    }

    Related code: https://github.com/rvantonder/CryptOSS

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:GitHub&inv=1&invt=Ab2YDQ (2022). GitHub Activity Data [Dataset]. https://console.cloud.google.com/marketplace/product/github/github-repos
Organization logoOrganization logo

GitHub Activity Data

Explore at:
53 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 23, 2022
Dataset provided by
GitHubhttps://github.com/
Googlehttp://google.com/
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008. This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

Search
Clear search
Close search
Google apps
Main menu