100+ datasets found
  1. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  2. Github Dataset - Data Analysis

    • kaggle.com
    zip
    Updated Feb 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Ranjan (2023). Github Dataset - Data Analysis [Dataset]. https://www.kaggle.com/datasets/abhishekrp1517/github-dataset-data-analysis
    Explore at:
    zip(422672 bytes)Available download formats
    Dataset updated
    Feb 24, 2023
    Authors
    Abhishek Ranjan
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Abhishek Ranjan

    Released under Database: Open Database, Contents: Database Contents

    Contents

  3. GitHub Programming Languages Data

    • kaggle.com
    zip
    Updated Jan 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
    Explore at:
    zip(41198 bytes)Available download formats
    Dataset updated
    Jan 2, 2022
    Authors
    Isaac Wen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

    One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

    Content

    This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

    Source

    This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

    Limitations

    Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.

  4. GitHub Public Repository Metadata

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars
    Explore at:
    zip(606866859 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

    The Github API Terms of Service apply.

    You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

    Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

    Example entry

    {
     "owner": "pelmers",
     "name": "text-rewriter",
     "stars": 13,
     "forks": 5,
     "watchers": 4,
     "isFork": false,
     "isArchived": false,
     "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
     "languageCount": 3,
     "topics": [ { "name": "chrome-extension", "stars": 43211 } ],
     "topicCount": 1,
     "diskUsageKb": 75,
     "pullRequests": 4,
     "issues": 12,
     "description": "Webextension to rewrite phrases in pages",
     "primaryLanguage": "JavaScript",
     "createdAt": "2015-03-14T22:35:11Z",
     "pushedAt": "2022-02-11T14:26:00Z",
     "defaultBranchCommitCount": 54,
     "license": null,
     "assignableUserCount": 1,
     "codeOfConduct": null,
     "forkingAllowed": true,
     "nameWithOwner": "pelmers/text-rewriter",
     "parent": null
    }
    

    The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.

  5. github-final-datasets

    • kaggle.com
    zip
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olga Ivanova (2023). github-final-datasets [Dataset]. https://www.kaggle.com/datasets/olgaiv39/github-final-datasets
    Explore at:
    zip(1877861953 bytes)Available download formats
    Dataset updated
    Nov 9, 2023
    Authors
    Olga Ivanova
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Github Clean Code Snippets Dataset

    Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.

    1 Step - Github Samples Database parsing

    The first part of the code samples was taken from a private version of this notebook.

    Here is the statistics about classes of programming languages from Github Code Snippets database https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">

    From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.

    2 Step - Github Bigquery Database parsing

    Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.

    The resulted file is dataset-10000.csv - included to the data card

    The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">

    3 Step - collection of code samples of raw coding samples

    To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card

    The statistics for rare languages code snippets is as follows: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">

    4 Step - First and second datasets combining

    For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv

    To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv

    After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv

    5 Step - Datasets cleaning from symbols and merging together with rare languages

    The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.

    The final distribution of classes turned out to be the next one https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">

    6 Step - Fixing up the labels

    To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.

  6. Google Universal Embedding Challenge Github Repo

    • kaggle.com
    zip
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2022). Google Universal Embedding Challenge Github Repo [Dataset]. https://www.kaggle.com/datasets/dschettler8845/google-universal-embedding-challenge-github-repo
    Explore at:
    zip(13561 bytes)Available download formats
    Dataset updated
    Jul 12, 2022
    Authors
    Darien Schettler
    Description

    Universal Embedding Challenge baseline model implementation.

    This folder contains the baseline model implementation for the Kaggle universal image embedding challenge based on

    Following the above ideas, we also add a 64 projection layer on top of the Vision Transformer base model as the final embedding, since the competition requires embeddings of at most 64 dimensions. Please find more details in image_classification.py.

    To use the code, please firstly install the prerequisites

    pip install -r universal_embedding_challenge/requirements.txt
    
    git clone https://github.com/tensorflow/models.git /tmp/models
    export PYTHONPATH=$PYTHONPATH:/tmp/models
    pip install --user -r /tmp/models/official/requirements.txt
    

    Secondly, please download the imagenet1k data in TFRecord format from https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0 and https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-1, and merge them together under folder imagenet-2012-tfrecord/. As a result, the paths to the training datasets and the validation datasets should be imagenet-2012-tfrecord/train* and imagenet-2012-tfrecord/validation*, respectively.

    The trainer for the model is implemented in train.py, and the following example launches the training

    python -m universal_embedding_challenge.train \
     --experiment=vit_with_bottleneck_imagenet_pretrain \
     --mode=train_and_eval \
     --model_dir=/tmp/imagenet1k_test
    

    The trained model checkpoints could be further converted to savedModel format using export_saved_model.py for Kaggle submission.

    The code to compute metrics for Universal Embedding Challenge is implemented in metrics.py and the code to read the solution file is implemented in read_retrieval_solution.py.

  7. Github Organizations - Social Network Analysis

    • kaggle.com
    zip
    Updated Jul 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Mehta (2022). Github Organizations - Social Network Analysis [Dataset]. https://www.kaggle.com/datasets/anshulmehtakaggl/github-organizations-social-network-analysis
    Explore at:
    zip(79175 bytes)Available download formats
    Dataset updated
    Jul 19, 2022
    Authors
    Anshul Mehta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About the Dataset

    A dataset containing the List of top Organizations on Github and their contributors.

    Tasks

    Primarily Intended for Social Network analysis. Check out the Starter Notebook for ref.

  8. Awesome Dataset Repository on GitHub

    • kaggle.com
    zip
    Updated Jun 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajesh Kumar Pandey (2020). Awesome Dataset Repository on GitHub [Dataset]. https://www.kaggle.com/datasets/rajeshkpandey/awesome-dataset-repository-on-github
    Explore at:
    zip(48522 bytes)Available download formats
    Dataset updated
    Jun 10, 2020
    Authors
    Rajesh Kumar Pandey
    Description

    Dataset

    This dataset was created by Rajesh Kumar Pandey

    Contents

  9. Learning Path Index Dataset

    • kaggle.com
    zip
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mani Sarkar (2024). Learning Path Index Dataset [Dataset]. https://www.kaggle.com/datasets/neomatrix369/learning-path-index-dataset/code
    Explore at:
    zip(151846 bytes)Available download formats
    Dataset updated
    Nov 6, 2024
    Authors
    Mani Sarkar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description

    The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.

    This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.

    Inspiration

    This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.

    Context

    This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.

    The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.

    Sources

    The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.

    Content

    The dataset encompasses the following attributes:

    • Course / Learning Material: The title of the Data Science, Machine Learning, or AI course or learning material.
    • Source: The provider or institution offering the course.
    • Course Level: The proficiency level, ranging from Beginner to Advanced.
    • Type (Free or Paid): Indicates whether the course is available for free or requires payment.
    • Module: Specific module or section within the course.
    • Duration: The estimated time required to complete the module or course.
    • Module / Sub-module Difficulty Level: The complexity level of the module or sub-module.
    • Keywords / Tags / Skills / Interests / Categories: Relevant keywords, tags, or categories associated with the course with a focus on Data Science, Machine Learning, and AI.
    • Links: Hyperlinks to access the course or learning material directly.

    How to contribute to this initiative?

    • You can also join us by taking part in the next KaggleX BIPOC Mentorship program (also see this)
    • Keep your eyes open on the Kaggle Discussions page and other KaggleX social media channels. Or find us on the Kaggle Discord channel to learn more about the next steps
    • Create notebooks from this data
    • Create supplementary or complementary data for or from this dataset
    • Submit corrections/enhancements or anything else to help improve this dataset so it has a wider use and purpose

    License

    The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.

    Important Links

    Credits

    Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...

  10. Real Indian users on Github

    • kaggle.com
    zip
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github/data
    Explore at:
    zip(610496 bytes)Available download formats
    Dataset updated
    Oct 6, 2024
    Authors
    Archit Tyagi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    📊 GitHub Indian Users Dataset

    Overview

    This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

    🧑‍💻 Dataset Contents

    The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

    🌟 Source and Inspiration

    This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

    Potential Use Cases

    1. Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.
    2. Community Growth: Analyze how the Indian developer community has grown over time on GitHub.
    3. Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.
    4. Regional Insights: Discover which cities or regions in India have the most active GitHub users.
    5. Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

    💡 Ideal for

    This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling

  11. Cat_VS_Dog_Model

    • kaggle.com
    zip
    Updated Nov 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takashi Tamura (2021). Cat_VS_Dog_Model [Dataset]. https://www.kaggle.com/ttkagglett/cat-vs-dog-model
    Explore at:
    zip(30113811 bytes)Available download formats
    Dataset updated
    Nov 15, 2021
    Authors
    Takashi Tamura
    Description

    This dataset is from the GitHub repo below. You can use a model trained with Dogs vs. Cats dataset on Kaggle. GitHub: https://github.com/amitrajitbose/cat-v-dog-classifier-pytorch Kaggle Competition: https://www.kaggle.com/c/dogs-vs-cats/data

  12. GitHub Repository Metadata

    • kaggle.com
    zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IsaacOresanya (2025). GitHub Repository Metadata [Dataset]. https://www.kaggle.com/datasets/isaacoresanya/github-repository-metadata
    Explore at:
    zip(4909486 bytes)Available download formats
    Dataset updated
    Apr 16, 2025
    Authors
    IsaacOresanya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset captures the metadata of 14,000+ repositories across GitHub. You’ll find everything from stars and forks to health scores, README previews, and language breakdowns.

    It’s ideal for: - Identifying repo trends over time - Comparing popular vs. low-engagement projects - Exploring what makes a repo “healthy”

    Perfect for learning data cleaning, analysis, and visualization using real-world project metadata.

  13. GitHub Social Network

    • kaggle.com
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gitanjali Wadhwa
    Description

    Description

    An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

    Properties

    • Directed: No.
    • Node features: Yes.
    • Edge features: No.
    • Node labels: Yes. Binary-labeled.
    • Temporal: No.
    • Nodes: 37,700
    • Edges: 289,003
    • Density: 0.001
    • Transitvity: 0.013

    Possible Tasks

    • Binary node classification
    • Link prediction
    • Community detection
    • Network visualisation
  14. Open-Source GitHub Repos: Stars, Issues & PRs

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Mebarek Mecheter (2024). Open-Source GitHub Repos: Stars, Issues & PRs [Dataset]. https://www.kaggle.com/datasets/mohammedmecheter/open-source-github-repos-stars-issues-and-prs
    Explore at:
    zip(17491462 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Mohammed Mebarek Mecheter
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction to the Data and Fetching Process

    This dataset comprises detailed information about GitHub repositories, issues, and pull requests, collected using the GitHub API. The data includes repository metadata (such as stars, forks, and open issues), along with historical data on issues and pull requests (PRs), including their creation, closure, and merging timelines.

    Repositories Data Dictionary

    This dataset contains information about GitHub repositories, including metadata such as stars, forks, and activity status.

    Column NameData TypeDescription
    idobjectUnique identifier for the repository.
    nameobjectName of the repository (e.g., "docker").
    full_nameobjectFull name of the repository (e.g., "prometheus/alertmanager").
    descriptionobjectDescription of the repository, may be empty.
    starsint64Number of stars the repository has.
    forksint64Number of times the repository has been forked.
    open_issuesint64Number of open issues in the repository.
    created_atdatetimeDate and time when the repository was created.
    updated_atdatetimeDate and time when the repository was last updated.
    size_categoryobjectCategorization of the repository based on the number of stars (micro, small, medium, large, mega).
    staleboolBoolean flag indicating if the repository is "stale" (hasn't been updated in over 6 months).
    stars_per_forkfloat64Number of stars per fork (calculated).
    stars_per_issuefloat64Number of stars per open issue (calculated).
    contributor_per_starfloat64Number of contributors per star (calculated).
    total_contributorsint64Total number of contributors from issues and pull requests.

    Issues Data Dictionary

    This dataset contains details of issues raised in the repositories, including information about their creation, closing, and state.

    Column NameData TypeDescription
    idobjectUnique identifier for the issue.
    created_atdatetimeDate and time when the issue was created.
    updated_atdatetimeDate and time when the issue was last updated.
    closed_atdatetimeDate and time when the issue was closed (optional, null if open).
    numberint64Issue number in the GitHub repository.
    repositoryobjectThe repository that the issue belongs to (name).
    stateobjectCurrent state of the issue (either "open" or "closed").
    titleobjectTitle of the issue.
    resolution_time_daysfloat64Number of days taken to resolve the issue (calculated, -1 for unresolved issues).

    Pull Requests Data Dictionary

    This dataset contains information about pull requests (PRs) in the repositories, including metadata such as their state, creation, closing, and merging time.

    Column NameData TypeDescription
    idobjectUnique identifier for the pull request.
    created_atdatetimeDate and time when the pull request was created.
    updated_atdatetimeDate and time when the pull request was last updated.
    closed_atdatetimeDate and time when the pull request was closed (optional, null if open).
    merged_atdatetimeDate and time when the pull request was merged (optional, null if not merge...
  15. Programming Languages

    • kaggle.com
    zip
    Updated Sep 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sujay Kapadnis (2023). Programming Languages [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/programming-languages
    Explore at:
    zip(879324 bytes)Available download formats
    Dataset updated
    Sep 16, 2023
    Authors
    Sujay Kapadnis
    Description

    The Dataset comes from Programming Languages Database

    languages.csv

    The full data dictionary is available from PLDB.com.

    variableclassdescription
    pldb_idcharacterA standardized, uniquified version of the language name, used as an ID on the PLDB site.
    titlecharacterThe official title of the language.
    descriptioncharacterDescription of the repo on GitHub.
    typecharacterWhich category in PLDB's subjective ontology does this entity fit into.
    appeareddoubleWhat year was the language publicly released and/or announced?
    creatorscharacterName(s) of the original creators of the language delimited by " and "
    websitecharacterURL of the official homepage for the language project.
    domain_namecharacterIf the project website is on its own domain.
    domain_name_registereddoubleWhen was this domain first registered?
    referencecharacterA link to more info about this entity.
    isbndbdoubleBooks about this language from ISBNdb.
    book_countdoubleComputed; the number of books found for this language at isbndb.com
    semantic_scholarintegerPapers about this language from Semantic Scholar.
    language_rankdoubleComputed; A rank for the language, taking into account various online rankings. The computation for this column is not currently clear.
    github_repocharacterURL of the official GitHub repo for the project if it hosted there.
    github_repo_starsdoubleHow many stars of the repo?
    github_repo_forksdoubleHow many forks of the repo?
    github_repo_updateddoubleWhat year was the last commit made?
    github_repo_subscribersdoubleHow many subscribers to the repo?
    github_repo_createddoubleWhen was the Github repo for this entity created?
    github_repo_descriptioncharacterDescription of the repo on GitHub.
    github_repo_issuesdoubleHow many isses on the repo?
    github_repo_first_commitdoubleWhat year the first commit made in this git repo?
    github_languagecharacterGitHub has a set of supported languages as defined here
    github_language_tm_scopecharacterThe TextMate scope that represents this programming language.
    github_language_typecharacterEither data, programming, markup, prose, or nil.
    github_language_ace_modecharacterA String name of the Ace Mode used for highlighting whenever a file is edited. This must match one of the filenames in http://git.io/3XO_Cg. Use "text" if a mode does not exist.
    github_language_file_extensionscharacterAn Array of associated extensions (the first one is considered the primary extension, the others should be listed alphabetically).
    github_language_reposdoubleHow many repos for this language does GitHub report?
    wikipediacharacterURL of the entity on Wikipedia, if and only if it has a page dedicated to it.
    wikipedia_daily_page_viewsdoubleHow many page views per day does this Wikipedia page get? Useful as a signal for rankings. Available via WP api.
    wikipedia_backlinks_countdoubleHow many pages on WP link to this page?
    wikipedia_summarycharacterWhat is the text summary of the language from the Wikipedia page?
    wikipedia_page_iddoubleWaht is the internal ID for this entity on WP?
    wikipedia_appeareddoubleWhen does Wikipedia claim this entity first appeared?
    wikipedia_createddoubleWhen was the Wikipedia page for this entity created?
    wikipedia_revision_countdoubleHow many revisions does this page have?
    wikipedia_relatedcharacterWhat languages does Wikipedia have as related?
    features_has_commentslogicalDoes this language have a comment character?
    features_has_semantic_indentationlogicalDoes indentation have semantic meaning in this language?
    features_has_line_commentslogicalDoes this language support inline comments (as opposed to comments that must span an entire line)?
    line_comment_tokencharacter...
  16. spinning-up-sac-ant-results

    • kaggle.com
    zip
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MoniGarrr (2025). spinning-up-sac-ant-results [Dataset]. https://www.kaggle.com/datasets/monigarrr/spinning-up-sac-ant-results
    Explore at:
    zip(2209 bytes)Available download formats
    Dataset updated
    Jul 20, 2025
    Authors
    MoniGarrr
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SAC Ant Results: OpenAI Spinning Up (Modernized PPO/SAC Baseline) Repository: github.com/monigarr/spinningup/tree/monigarr-dev

    This dataset contains structured results from reinforcement learning (RL) experiments using the Soft Actor-Critic (SAC) algorithm on the Ant-v5 environment. It is part of MoniGarr's initiative to modernize and extend the original OpenAI Spinning Up codebase for current Python, PyTorch, and Gymnasium ecosystems.

    The results include detailed logs of reward progression, hyperparameter configurations, evaluation summaries, and visualizations, all generated through reproducible experimentation using an updated and extensible RL workflow.

    DATASET CONTENTS: File Name Description sac_ant_results.csv Epoch-level log of training rewards, timesteps, and key metrics sac_config.json Full configuration used for the SAC training run sac_eval_metrics.json Summary of evaluation metrics including reward and return sac_training_plot.png Reward curve visualization (training performance over time) experiment_notes.md Key observations and tuning notes from the experiments

    METHODOLOGY - Cloned and refactored OpenAI’s Spinning Up repo - Replaced deprecated gym with gymnasium - Updated SAC implementation for compatibility with PyTorch 2.x - Ran long-horizon training on Ant-v5 with multiple seeds and checkpoints - Used custom logging for exportable CSV/JSON format results

    INTENDED USES This dataset supports: * Baseline reproduction and RL benchmarking * Curriculum development in deep reinforcement learning * Comparative analysis of SAC vs. PPO/TD3 * Applied research, debugging, and educational tutorials

    WHY THIS DATASET IS USEFUL Maintaining parity with evolving RL tools is essential for ensuring reproducibility and learning efficiency. This dataset: * Demonstrates SAC performance under modern configurations * Offers ready-to-use logs and plots for analysis and reporting * Enables faster experimentation for RL students and developers

    PROJECT CONTEXT This work is part of MoniGarr's larger suite of open-source AI efforts focused on: * Modernizing legacy ML frameworks * Promoting accessible, well-documented reinforcement learning pipelines * Supporting low-resource developers and researchers with reproducible tools

    GEOSPATIAL COVERAGE - Primary Location: Akwesasne NY, Akwesasne Ontario - Extended Context: Worldwide (Open-source reproducibility) - The dataset was generated in Akwesasne but it's intended for worldwide use in reproducible RL research and education. Since the data is synthetic and code-driven, there's no human subject or location-bound data involved.

    ASSOCIATED PAPERS & SOURCES This dataset builds upon and modernizes results from:

    SPINNING UP IN DEEP RL : OpenAI GitHub: https://github.com/openai/spinningup Paper: https://spinningup.openai.com/en/latest/spinningup.pdf

    SOFT ACTOR-CRITIC ALGORITHMS Haarnoja et al., 2018 Paper: https://arxiv.org/abs/1801.01290 SAC Code Reference: https://github.com/denisyarats/pytorch_sac

    EXPECTED UPDATE FREQUENCY Initial Release: Complete Updates: Occasional — only if benchmark improvements, environment changes, or additional baseline comparisons (e.g., TD3, PPO-Penalty) are added. Community Contributions: Welcome via GitHub PRs and issues.

    RECOMMENDED COVERAGE - Reinforcement Learning education and experimentation - Benchmarking reproducible SAC performance on Ant-v5 - Use in papers, blogs, notebooks, or reproducibility studies - Modern RL code comparisons (Gym → Gymnasium, legacy → PyTorch 2.x)

    If you find the dataset helpful, feel free to ⭐️ the repo or connect with @MoniGarr. https://github.com/monigarr/spinningup/tree/monigarr-dev

  17. GitHub Code Snippets - Development sample

    • kaggle.com
    zip
    Updated Mar 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zomglings (2021). GitHub Code Snippets - Development sample [Dataset]. https://www.kaggle.com/simiotic/github-code-snippets-development-sample
    Explore at:
    zip(494262965 bytes)Available download formats
    Dataset updated
    Mar 7, 2021
    Authors
    zomglings
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a sample of approximately 5% of the full GitHub Code Snippets dataset.

    The full dataset is over 60GB in size, and can be difficult to work with in Kaggle notebooks. We have released this development dataset for prototyping purposes so that you can test different ideas on a smaller dataset before running processing on the full dataset.

    Issues and requests

    This dataset is built and maintained by Bugout.dev. To report an issue with the data, to request changes in future versions of the dataset, please open a discussion thread under the full dataset..

  18. Github User Analysis 2019 for Graph Dataset

    • kaggle.com
    zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christofel Ganteng (2023). Github User Analysis 2019 for Graph Dataset [Dataset]. https://www.kaggle.com/datasets/christofel04/github-user-analysis-2019-for-graph-dataset
    Explore at:
    zip(2447769 bytes)Available download formats
    Dataset updated
    Sep 25, 2023
    Authors
    Christofel Ganteng
    Description

    [ GitHub User Analysis 2019 for Graph Dataset ]

    This is GitHub User Analysis 2019 for Graph Dataset. A large social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This target feature was derived from the job title of each user.

    Data Description :

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2941945%2F0297b539f7d955df091ebc19eee2d996%2FScreenshot%20from%202023-09-25%2016-30-37.png?generation=1695627254231053&alt=media" alt="">

    GitHub User Analysis 2019 for Graph Dataset Tasks :

    1. Can you predict GitHub User 2019 is a Software Engineer or AI Engineer based on GitHub User 2019 Analysis and GitHub Post and tendency ? 2. Can you predict GitHub User 2019 would follow AI Researcher based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 3. Can you predict GitHub user 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 4. Can you predict GitHub User 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post Tendency ? Try to Visualize GitHub User 2019 Analysis and Tendency and try to find GitHub User 2019 Analysis and Tendency Pattern.

  19. GitHub Dataset

    • kaggle.com
    zip
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Raj (2023). GitHub Dataset [Dataset]. https://www.kaggle.com/nikhil25803/github-dataset
    Explore at:
    zip(79399228 bytes)Available download formats
    Dataset updated
    Mar 2, 2023
    Authors
    Nikhil Raj
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    We have two versions of dataset available

    Version 1 Link

    This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and issue count.

    While working on a repository recommendation project, I curated this data by scraping around 18000+ repositories and filtered those that have at least one issue open so that we can recommend the user a repository to which he/she can contribute.

    Columns repositories - the name of the repository (Format - github_username/repository_name) stars_count - stars count of the repository forks_count - fork count of the repository issues_count - active/opened issues in the repository pull_requests - pull requests opened in the repository contributors - contributors contribute to the project so far language - primary language used in the project

    Version 2 Link

    Found a JSON data on Kaggle, (link) and wrote a preprocessing function to convert them into a CSV file.

    This is comparatively a bigger dataset, with 2917951 repositories data.

    Columns name - the name of the repository stars_count - stars count of the repository forks_count - forks count of the repository watchers - watchers in the repository pull_requests - pull requests made in the repository primary_language - the primary language of the repository languages_used - list of all the languages used in the repository commit_count - commits made in the repository created_at - time and date when the repository was created license - license assigned to the repository.

    Note The data in the dataset is from the time when it was scrapped, so any updates in the actual repository will not be reflected here.

  20. AI vs. Human Code: A Comparative Dataset

    • kaggle.com
    zip
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parul Jain (2025). AI vs. Human Code: A Comparative Dataset [Dataset]. https://www.kaggle.com/datasets/paruljain1024pp/ai-vs-human-code-a-comparative-dataset
    Explore at:
    zip(96401 bytes)Available download formats
    Dataset updated
    Feb 11, 2025
    Authors
    Parul Jain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rise of AI-assisted coding tools like GitHub Copilot, ChatGPT, and Codeium, the debate between AI-generated vs. human-written code has gained momentum. This dataset provides a structured comparison of 5,000 code snippets across multiple programming languages and problem domains.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Organization logo

GitHub Repos

Code and comments from 2.8 million repos

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

  • This is the perfect dataset for fighting language wars.
  • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Search
Clear search
Close search
Google apps
Main menu