100+ datasets found
  1. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  2. h

    github-code-clean

    • huggingface.co
    • opendatalab.com
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). github-code-clean [Dataset]. https://huggingface.co/datasets/codeparrot/github-code-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2022
    Dataset authored and provided by
    CodeParrot
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.

  3. GitHub Public Repository Metadata

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars
    Explore at:
    zip(606866859 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

    The Github API Terms of Service apply.

    You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

    Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

    Example entry

    {
     "owner": "pelmers",
     "name": "text-rewriter",
     "stars": 13,
     "forks": 5,
     "watchers": 4,
     "isFork": false,
     "isArchived": false,
     "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
     "languageCount": 3,
     "topics": [ { "name": "chrome-extension", "stars": 43211 } ],
     "topicCount": 1,
     "diskUsageKb": 75,
     "pullRequests": 4,
     "issues": 12,
     "description": "Webextension to rewrite phrases in pages",
     "primaryLanguage": "JavaScript",
     "createdAt": "2015-03-14T22:35:11Z",
     "pushedAt": "2022-02-11T14:26:00Z",
     "defaultBranchCommitCount": 54,
     "license": null,
     "assignableUserCount": 1,
     "codeOfConduct": null,
     "forkingAllowed": true,
     "nameWithOwner": "pelmers/text-rewriter",
     "parent": null
    }
    

    The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.

  4. h

    github-code-2025

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NIck Saga, github-code-2025 [Dataset]. https://huggingface.co/datasets/nick007x/github-code-2025
    Explore at:
    Authors
    NIck Saga
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🚀 GitHub Code 2025: The Clean Code Manifesto

    A meticulously curated dataset of 1.5M+ repositories representing both quality and innovation in 2025's code ecosystem

      🌟 The Philosophy
    

    Quality Over Quantity, Purpose Over Volume In an era of data abundance, we present a dataset built on radical curation. Every file, every repository, every byte has been carefully selected to represent the signal in the noise of open-source development.

      🎯 What This Dataset Is… See the full description on the dataset page: https://huggingface.co/datasets/nick007x/github-code-2025.
    
  5. h

    code-parrot-github-code

    • huggingface.co
    Updated Mar 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macrocosmos (2022). code-parrot-github-code [Dataset]. https://huggingface.co/datasets/macrocosm-os/code-parrot-github-code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 17, 2022
    Dataset authored and provided by
    Macrocosmos
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    GitHub Code Dataset

      Dataset Description
    

    The GitHub Code dataset consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in 1TB of data. The dataset was created from the public GitHub dataset on Google BiqQuery.

      How to use it
    

    The GitHub Code dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets. You can load and iterate through the dataset with the following… See the full description on the dataset page: https://huggingface.co/datasets/macrocosm-os/code-parrot-github-code.

  6. GitHub Programming Languages Data

    • kaggle.com
    zip
    Updated Jan 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
    Explore at:
    zip(41198 bytes)Available download formats
    Dataset updated
    Jan 2, 2022
    Authors
    Isaac Wen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

    One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

    Content

    This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

    Source

    This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

    Limitations

    Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.

  7. m

    Data extracted from GitHub repositories (training and test data-sets)

    • data.mendeley.com
    Updated Aug 1, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youcef Bouziane (2019). Data extracted from GitHub repositories (training and test data-sets) [Dataset]. http://doi.org/10.17632/gt3f4jnbvn.3
    Explore at:
    Dataset updated
    Aug 1, 2019
    Authors
    Youcef Bouziane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the SQL tables of the training and test datasets used in our experimentation. These tables contain the preprocessed textual data (in a form of tokens) extracted from each training and test project. Besides the preprocessed textual data, this dataset also contains meta-data about the projects, GitHub topics, and GitHub collections. The GitHub projects are identified by the tuple “Owner” and “Name”. The descriptions of the table fields are attached to their respective data descriptions.

  8. h

    github-repos-metadata-40M

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibragim, github-repos-metadata-40M [Dataset]. https://huggingface.co/datasets/ibragim-bad/github-repos-metadata-40M
    Explore at:
    Authors
    Ibragim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📊 Metadata for 40 million GitHub repositories

    A cleaned, analysis-ready dataset with per-repository statistics aggregated from GH Archive events: stars, forks, pull requests, open issues, visibility, language signals, and more. Column names mirror the GH Archive / GitHub API semantics where possible. GitHub repo: https://github.com/ibragim-bad/github-repos-metadata-40M

    Source: GH Archive (public GitHub event stream).

      🚀 Quickstart
    

    from datasets import load_dataset

    ds… See the full description on the dataset page: https://huggingface.co/datasets/ibragim-bad/github-repos-metadata-40M.

  9. GitHub Repository Metadata

    • kaggle.com
    zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IsaacOresanya (2025). GitHub Repository Metadata [Dataset]. https://www.kaggle.com/datasets/isaacoresanya/github-repository-metadata
    Explore at:
    zip(4909486 bytes)Available download formats
    Dataset updated
    Apr 16, 2025
    Authors
    IsaacOresanya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset captures the metadata of 14,000+ repositories across GitHub. You’ll find everything from stars and forks to health scores, README previews, and language breakdowns.

    It’s ideal for: - Identifying repo trends over time - Comparing popular vs. low-engagement projects - Exploring what makes a repo “healthy”

    Perfect for learning data cleaning, analysis, and visualization using real-world project metadata.

  10. g

    Amazon review data 2018

    • nijianmo.github.io
    • cseweb.ucsd.edu
    • +1more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon review data 2018 [Dataset]. https://nijianmo.github.io/amazon/
    Explore at:
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Context

    This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

    • More reviews:

      • The total number of reviews is 233.1 million (142.8 million in 2014).
    • New reviews:

      • Current data includes reviews in the range May 1996 - Oct 2018.
    • Metadata: - We have added transaction metadata for each review shown on the review page.

      • Added more detailed metadata of the product landing page.

    Acknowledgements

    If you publish articles based on this dataset, please cite the following paper:

    • Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
  11. Z

    Data from: A Dataset for GitHub Repository Deduplication

    • data-staging.niaid.nih.gov
    Updated Feb 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris (2020). A Dataset for GitHub Repository Deduplication [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3653919
    Explore at:
    Dataset updated
    Feb 9, 2020
    Dataset provided by
    University of Tennessee
    Athens University of Economics and Business
    Authors
    Spinellis, Diomidis; Kotti, Zoe; Mockus, Audris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

    The dataset is provided as two files identifying GitHub repositories using the login-name/project-name convention. The file deduplicate_names contains 10,649,348 tab-separated records mapping a duplicated source project to a definitive target project.

    The file forks_clones_noise_names is a 50,324,363 member superset of the source projects, containing also projects that were excluded from the mapping as noise.

  12. R

    Github Dataset

    • universe.roboflow.com
    zip
    Updated Oct 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    new-workspace-u4xkn (2021). Github Dataset [Dataset]. https://universe.roboflow.com/new-workspace-u4xkn/github/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 26, 2021
    Dataset authored and provided by
    new-workspace-u4xkn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Projects Bounding Boxes
    Description

    Github

    ## Overview
    
    Github is a dataset for object detection tasks - it contains Projects annotations for 848 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  13. Data from: GHALogs: Large-Scale Dataset of GitHub Actions Runs

    • zenodo.org
    application/gzip, zip
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon (2024). GHALogs: Large-Scale Dataset of GitHub Actions Runs [Dataset]. http://doi.org/10.5281/zenodo.10154920
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon; Florent Moriconi; Thomas Durieux; Jean-Rémy Falleri; Raphael Troncy; Aurélien Francillon
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Oct 2023
    Description

    In recent years, continuous integration and deployment (CI/CD) has become increasingly popular in both the open-source community and industry. Evaluating CI/CD performance is a critical aspect of software development, as it not only helps minimize execution costs but also ensures faster feedback for developers. Despite its importance, there is limited fine-grained knowledge about the performance of CI/CD processes—knowledge that is essential for identifying bottlenecks and optimization opportunities.
    Moreover, the availability of large-scale, publicly accessible datasets of CI/CD logs remains scarce. The few datasets that do exist are often outdated and lack comprehensive coverage. To address this gap, we introduce a new dataset comprising 116k CI/CD workflows executed using GitHub Actions (GHA) across 25k public code projects spanning 20 different programming languages.
    This dataset includes 513k workflow runs encompassing 2.3 million individual steps. For each workflow run, we provide detailed metadata along with complete run logs. To the best of our knowledge, this is the largest dataset of CI/CD runs that includes full log data. The inclusion of these logs enables more in-depth analysis of CI/CD pipelines, offering insights that cannot be gleaned solely from code repositories.
    We postulate that this dataset will facilitate future CI/CD pipeline behavior research through log-based analysis. Potential applications include performance evaluation (e.g., measuring task execution times) and root cause analysis (e.g., identifying reasons for pipeline failures).

  14. GitHub Social Network

    • kaggle.com
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gitanjali Wadhwa
    Description

    Description

    An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

    Properties

    • Directed: No.
    • Node features: Yes.
    • Edge features: No.
    • Node labels: Yes. Binary-labeled.
    • Temporal: No.
    • Nodes: 37,700
    • Edges: 289,003
    • Density: 0.001
    • Transitvity: 0.013

    Possible Tasks

    • Binary node classification
    • Link prediction
    • Community detection
    • Network visualisation
  15. Readme files in 16,000,000 public GitHub repositories (October 2016)

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markovtsev Vadim; Markovtsev Vadim (2020). Readme files in 16,000,000 public GitHub repositories (October 2016) [Dataset]. http://doi.org/10.5281/zenodo.285419
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Markovtsev Vadim; Markovtsev Vadim
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Format

    index.csv.gz - CSV comma separated file with 3 columns:

    The flag is either "s" (readme found) or "r" (readme does not exist on the root directory level). Readme file name may be any from the list:

    "README.md", "readme.md", "Readme.md", "README.MD", "README.txt", "readme.txt", "Readme.txt", "README.TXT", "README", "readme", "Readme", "README.rst", "readme.rst", "Readme.rst", "README.RST"

    100 part-r-00xxx files are in "new" Hadoop API format with the following settings:

    1. inputFormatClass is org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

    2. keyClass is org.apache.hadoop.io.Text - repository name

    3. valueClass is org.apache.hadoop.io.BytesWritable - gzipped readme file

  16. GitHub Issues and Comments

    • figshare.com
    html
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Donisan (2022). GitHub Issues and Comments [Dataset]. http://doi.org/10.6084/m9.figshare.20024303.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 8, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Alexandra Donisan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset containing GitHub issues (that are labeled using technical debt keywords) together with their comments. Both issues and comments have their GitHub reactions. The dataset is a MongoDB exported JSON.

  17. CGP on GitHub

    • catalog.data.gov
    Updated Sep 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Government Publishing Office (2025). CGP on GitHub [Dataset]. https://catalog.data.gov/dataset/cgp-on-github
    Explore at:
    Dataset updated
    Sep 4, 2025
    Dataset provided by
    U.S. Government Publishing Officehttp://www.gpo.gov/
    Description

    The CGP on GitHub is a repository of cataloging/metadata resources extracted from the bibliographic records of the Catalog of U.S. Government Publications (CGP). The CGP is the U.S. Government Publishing Office's (GPO) finding tool for publications of the executive, judicial, and legislative branches, and other entities of the U.S. Federal Government. The CGP records comprise the National Collection of U.S. Government Public Information and contain descriptive and subject information to enable the discovery of these resources. Many CGP records provide PURL (persistent uniform resource locator) links to the online versions of publications. For more information, please visit the CGP help pages.

  18. w

    Websites using Github.com Repository

    • webtechsurvey.com
    csv
    Updated Dec 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WebTechSurvey (2021). Websites using Github.com Repository [Dataset]. https://webtechsurvey.com/technology/github.com-repository
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 8, 2021
    Dataset authored and provided by
    WebTechSurvey
    License

    https://webtechsurvey.com/termshttps://webtechsurvey.com/terms

    Time period covered
    2025
    Area covered
    Global
    Description

    A complete list of live websites using the Github.com Repository technology, compiled through global website indexing conducted by WebTechSurvey.

  19. g

    Detailed Epidemiological Data from the COVID-19 Outbreak

    • github.com
    • catalog.midasnetwork.us
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open COVID-19 Data Working Group, Detailed Epidemiological Data from the COVID-19 Outbreak [Dataset]. https://github.com/beoutbreakprepared/nCoV2019
    Explore at:
    Dataset provided by
    Open COVID-19 Data Working Group
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data and code repository for the Open COVID-19 Data Working Group: a global and multi-organizational initative that aims to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.

  20. b

    GitHub username

    • bioregistry.io
    Updated Apr 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). GitHub username [Dataset]. http://identifiers.org/re3data:r3d100010375
    Explore at:
    Dataset updated
    Apr 1, 2022
    Description

    GitHub is an online host of Git source code repositories.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Organization logo

GitHub Repos

Code and comments from 2.8 million repos

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

  • This is the perfect dataset for fighting language wars.
  • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Search
Clear search
Close search
Google apps
Main menu