100+ datasets found

GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Github Dataset - Data Analysis
kaggle.com
zip
Updated Feb 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Ranjan (2023). Github Dataset - Data Analysis [Dataset]. https://www.kaggle.com/datasets/abhishekrp1517/github-dataset-data-analysis
Explore at:
zip(422672 bytes)Available download formats
Dataset updated
Feb 24, 2023
Authors
Abhishek Ranjan
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Dataset

This dataset was created by Abhishek Ranjan

Released under Database: Open Database, Contents: Database Contents

Contents
GitHub Programming Languages Data
kaggle.com
zip
Updated Jan 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
Explore at:
zip(41198 bytes)Available download formats
Dataset updated
Jan 2, 2022
Authors
Isaac Wen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

Content

This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

Source

This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

Limitations

Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.
GitHub Public Repository Metadata
kaggle.com
zip
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars
Explore at:
zip(606866859 bytes)Available download formats
Dataset updated
Oct 26, 2025
Authors
Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

The Github API Terms of Service apply.

You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

Example entry

{ "owner": "pelmers", "name": "text-rewriter", "stars": 13, "forks": 5, "watchers": 4, "isFork": false, "isArchived": false, "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ], "languageCount": 3, "topics": [ { "name": "chrome-extension", "stars": 43211 } ], "topicCount": 1, "diskUsageKb": 75, "pullRequests": 4, "issues": 12, "description": "Webextension to rewrite phrases in pages", "primaryLanguage": "JavaScript", "createdAt": "2015-03-14T22:35:11Z", "pushedAt": "2022-02-11T14:26:00Z", "defaultBranchCommitCount": 54, "license": null, "assignableUserCount": 1, "codeOfConduct": null, "forkingAllowed": true, "nameWithOwner": "pelmers/text-rewriter", "parent": null }

The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.
github-final-datasets
kaggle.com
zip
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olga Ivanova (2023). github-final-datasets [Dataset]. https://www.kaggle.com/datasets/olgaiv39/github-final-datasets
Explore at:
zip(1877861953 bytes)Available download formats
Dataset updated
Nov 9, 2023
Authors
Olga Ivanova
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Github Clean Code Snippets Dataset

Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.

1 Step - Github Samples Database parsing

The first part of the code samples was taken from a private version of this notebook.

Here is the statistics about classes of programming languages from Github Code Snippets database https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">

From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.

2 Step - Github Bigquery Database parsing

Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.

The resulted file is dataset-10000.csv - included to the data card

The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">

3 Step - collection of code samples of raw coding samples

To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card

The statistics for rare languages code snippets is as follows: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">

4 Step - First and second datasets combining

For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv

To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv

After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv

5 Step - Datasets cleaning from symbols and merging together with rare languages

The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.

The final distribution of classes turned out to be the next one https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">

6 Step - Fixing up the labels

To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Google Universal Embedding Challenge Github Repo
kaggle.com
zip
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2022). Google Universal Embedding Challenge Github Repo [Dataset]. https://www.kaggle.com/datasets/dschettler8845/google-universal-embedding-challenge-github-repo
Explore at:
zip(13561 bytes)Available download formats
Dataset updated
Jul 12, 2022
Authors
Darien Schettler
Description
Universal Embedding Challenge baseline model implementation.

This folder contains the baseline model implementation for the Kaggle universal image embedding challenge based on

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Training data-efficient image transformers & distillation through attention.

Following the above ideas, we also add a 64 projection layer on top of the Vision Transformer base model as the final embedding, since the competition requires embeddings of at most 64 dimensions. Please find more details in image_classification.py.

To use the code, please firstly install the prerequisites

pip install -r universal_embedding_challenge/requirements.txt git clone https://github.com/tensorflow/models.git /tmp/models export PYTHONPATH=$PYTHONPATH:/tmp/models pip install --user -r /tmp/models/official/requirements.txt

Secondly, please download the imagenet1k data in TFRecord format from https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0 and https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-1, and merge them together under folder imagenet-2012-tfrecord/. As a result, the paths to the training datasets and the validation datasets should be imagenet-2012-tfrecord/train* and imagenet-2012-tfrecord/validation*, respectively.

The trainer for the model is implemented in train.py, and the following example launches the training

python -m universal_embedding_challenge.train \ --experiment=vit_with_bottleneck_imagenet_pretrain \ --mode=train_and_eval \ --model_dir=/tmp/imagenet1k_test

The trained model checkpoints could be further converted to savedModel format using export_saved_model.py for Kaggle submission.

The code to compute metrics for Universal Embedding Challenge is implemented in metrics.py and the code to read the solution file is implemented in read_retrieval_solution.py.
Github Organizations - Social Network Analysis
kaggle.com
zip
Updated Jul 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Mehta (2022). Github Organizations - Social Network Analysis [Dataset]. https://www.kaggle.com/datasets/anshulmehtakaggl/github-organizations-social-network-analysis
Explore at:
zip(79175 bytes)Available download formats
Dataset updated
Jul 19, 2022
Authors
Anshul Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About the Dataset

A dataset containing the List of top Organizations on Github and their contributors.

Tasks

Primarily Intended for Social Network analysis. Check out the Starter Notebook for ref.
Awesome Dataset Repository on GitHub
kaggle.com
zip
Updated Jun 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajesh Kumar Pandey (2020). Awesome Dataset Repository on GitHub [Dataset]. https://www.kaggle.com/datasets/rajeshkpandey/awesome-dataset-repository-on-github
Explore at:
zip(48522 bytes)Available download formats
Dataset updated
Jun 10, 2020
Authors
Rajesh Kumar Pandey
Description
Dataset

This dataset was created by Rajesh Kumar Pandey

Contents
Learning Path Index Dataset
kaggle.com
zip
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mani Sarkar (2024). Learning Path Index Dataset [Dataset]. https://www.kaggle.com/datasets/neomatrix369/learning-path-index-dataset/code
Explore at:
zip(151846 bytes)Available download formats
Dataset updated
Nov 6, 2024
Authors
Mani Sarkar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description

The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.

This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.

Inspiration

This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.

Context

This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.

The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.

Sources

The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.

Content

The dataset encompasses the following attributes:

Course / Learning Material: The title of the Data Science, Machine Learning, or AI course or learning material.

Source: The provider or institution offering the course.

Course Level: The proficiency level, ranging from Beginner to Advanced.

Type (Free or Paid): Indicates whether the course is available for free or requires payment.

Module: Specific module or section within the course.

Duration: The estimated time required to complete the module or course.

Module / Sub-module Difficulty Level: The complexity level of the module or sub-module.

Keywords / Tags / Skills / Interests / Categories: Relevant keywords, tags, or categories associated with the course with a focus on Data Science, Machine Learning, and AI.

Links: Hyperlinks to access the course or learning material directly.

How to contribute to this initiative?

You can also join us by taking part in the next KaggleX BIPOC Mentorship program (also see this)

Keep your eyes open on the Kaggle Discussions page and other KaggleX social media channels. Or find us on the Kaggle Discord channel to learn more about the next steps

Create notebooks from this data

Create supplementary or complementary data for or from this dataset

Submit corrections/enhancements or anything else to help improve this dataset so it has a wider use and purpose

License

The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.

Important Links

KaggleX BIPOC Mentorship program (also see this)

KaggleX Learning Path Index Dataset

KaggleX Learning Path Index GitHub Repo

New Official Kaggle Discord Server!

Credits

Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...
Real Indian users on Github
kaggle.com
zip
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archit Tyagi (2024). Real Indian users on Github [Dataset]. https://www.kaggle.com/datasets/archittyagi108/real-indian-users-on-github/data
Explore at:
zip(610496 bytes)Available download formats
Dataset updated
Oct 6, 2024
Authors
Archit Tyagi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
📊 GitHub Indian Users Dataset

Overview

This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.

🧑‍💻 Dataset Contents

The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)

🌟 Source and Inspiration

This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.

Potential Use Cases

Trend Analysis: Identify popular programming languages, tech stacks, and frameworks among Indian developers.

Community Growth: Analyze how the Indian developer community has grown over time on GitHub.

Social Network Analysis: Understand the follower and following patterns to uncover influential developers within the Indian tech community.

Regional Insights: Discover which cities or regions in India have the most active GitHub users.

Career Development: Insights for recruiters looking to identify and understand potential talent pools in India.

💡 Ideal for

This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling
Cat_VS_Dog_Model
kaggle.com
zip
Updated Nov 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takashi Tamura (2021). Cat_VS_Dog_Model [Dataset]. https://www.kaggle.com/ttkagglett/cat-vs-dog-model
Explore at:
zip(30113811 bytes)Available download formats
Dataset updated
Nov 15, 2021
Authors
Takashi Tamura
Description
This dataset is from the GitHub repo below. You can use a model trained with Dogs vs. Cats dataset on Kaggle. GitHub: https://github.com/amitrajitbose/cat-v-dog-classifier-pytorch Kaggle Competition: https://www.kaggle.com/c/dogs-vs-cats/data
GitHub Repository Metadata
kaggle.com
zip
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IsaacOresanya (2025). GitHub Repository Metadata [Dataset]. https://www.kaggle.com/datasets/isaacoresanya/github-repository-metadata
Explore at:
zip(4909486 bytes)Available download formats
Dataset updated
Apr 16, 2025
Authors
IsaacOresanya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset captures the metadata of 14,000+ repositories across GitHub. You’ll find everything from stars and forks to health scores, README previews, and language breakdowns.

It’s ideal for: - Identifying repo trends over time - Comparing popular vs. low-engagement projects - Exploring what makes a repo “healthy”

Perfect for learning data cleaning, analysis, and visualization using real-world project metadata.
GitHub Social Network
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gitanjali Wadhwa
Description
Description

An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

Properties

Directed: No.

Node features: Yes.

Edge features: No.

Node labels: Yes. Binary-labeled.

Temporal: No.

Nodes: 37,700

Edges: 289,003

Density: 0.001

Transitvity: 0.013

Possible Tasks

Binary node classification

Link prediction

Community detection

Network visualisation

Open-Source GitHub Repos: Stars, Issues & PRs

kaggle.com

zip

Updated Sep 6, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Mohammed Mebarek Mecheter (2024). Open-Source GitHub Repos: Stars, Issues & PRs [Dataset]. https://www.kaggle.com/datasets/mohammedmecheter/open-source-github-repos-stars-issues-and-prs

Explore at:

zip(17491462 bytes)Available download formats

Dataset updated

Sep 6, 2024

Authors

Mohammed Mebarek Mecheter

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Introduction to the Data and Fetching Process

This dataset comprises detailed information about GitHub repositories, issues, and pull requests, collected using the GitHub API. The data includes repository metadata (such as stars, forks, and open issues), along with historical data on issues and pull requests (PRs), including their creation, closure, and merging timelines.

Repositories Data Dictionary

This dataset contains information about GitHub repositories, including metadata such as stars, forks, and activity status.

Column Name	Data Type	Description
`id`	object	Unique identifier for the repository.
`name`	object	Name of the repository (e.g., "docker").
`full_name`	object	Full name of the repository (e.g., "prometheus/alertmanager").
`description`	object	Description of the repository, may be empty.
`stars`	int64	Number of stars the repository has.
`forks`	int64	Number of times the repository has been forked.
`open_issues`	int64	Number of open issues in the repository.
`created_at`	datetime	Date and time when the repository was created.
`updated_at`	datetime	Date and time when the repository was last updated.
`size_category`	object	Categorization of the repository based on the number of stars (micro, small, medium, large, mega).
`stale`	bool	Boolean flag indicating if the repository is "stale" (hasn't been updated in over 6 months).
`stars_per_fork`	float64	Number of stars per fork (calculated).
`stars_per_issue`	float64	Number of stars per open issue (calculated).
`contributor_per_star`	float64	Number of contributors per star (calculated).
`total_contributors`	int64	Total number of contributors from issues and pull requests.

Issues Data Dictionary

This dataset contains details of issues raised in the repositories, including information about their creation, closing, and state.

Column Name	Data Type	Description
`id`	object	Unique identifier for the issue.
`created_at`	datetime	Date and time when the issue was created.
`updated_at`	datetime	Date and time when the issue was last updated.
`closed_at`	datetime	Date and time when the issue was closed (optional, null if open).
`number`	int64	Issue number in the GitHub repository.
`repository`	object	The repository that the issue belongs to (name).
`state`	object	Current state of the issue (either "open" or "closed").
`title`	object	Title of the issue.
`resolution_time_days`	float64	Number of days taken to resolve the issue (calculated, -1 for unresolved issues).

Pull Requests Data Dictionary

This dataset contains information about pull requests (PRs) in the repositories, including metadata such as their state, creation, closing, and merging time.

Column Name	Data Type	Description
`id`	object	Unique identifier for the pull request.
`created_at`	datetime	Date and time when the pull request was created.
`updated_at`	datetime	Date and time when the pull request was last updated.
`closed_at`	datetime	Date and time when the pull request was closed (optional, null if open).
`merged_at`	datetime	Date and time when the pull request was merged (optional, null if not merge...

Programming Languages

kaggle.com

zip

Updated Sep 16, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Sujay Kapadnis (2023). Programming Languages [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/programming-languages

Explore at:

zip(879324 bytes)Available download formats

Dataset updated

Sep 16, 2023

Authors

Sujay Kapadnis

Description

The Dataset comes from Programming Languages Database

`languages.csv`

The full data dictionary is available from PLDB.com.

variable	class	description
pldb_id	character	A standardized, uniquified version of the language name, used as an ID on the PLDB site.
title	character	The official title of the language.
description	character	Description of the repo on GitHub.
type	character	Which category in PLDB's subjective ontology does this entity fit into.
appeared	double	What year was the language publicly released and/or announced?
creators	character	Name(s) of the original creators of the language delimited by " and "
website	character	URL of the official homepage for the language project.
domain_name	character	If the project website is on its own domain.
domain_name_registered	double	When was this domain first registered?
reference	character	A link to more info about this entity.
isbndb	double	Books about this language from ISBNdb.
book_count	double	Computed; the number of books found for this language at isbndb.com
semantic_scholar	integer	Papers about this language from Semantic Scholar.
language_rank	double	Computed; A rank for the language, taking into account various online rankings. The computation for this column is not currently clear.
github_repo	character	URL of the official GitHub repo for the project if it hosted there.
github_repo_stars	double	How many stars of the repo?
github_repo_forks	double	How many forks of the repo?
github_repo_updated	double	What year was the last commit made?
github_repo_subscribers	double	How many subscribers to the repo?
github_repo_created	double	When was the Github repo for this entity created?
github_repo_description	character	Description of the repo on GitHub.
github_repo_issues	double	How many isses on the repo?
github_repo_first_commit	double	What year the first commit made in this git repo?
github_language	character	GitHub has a set of supported languages as defined here
github_language_tm_scope	character	The TextMate scope that represents this programming language.
github_language_type	character	Either data, programming, markup, prose, or nil.
github_language_ace_mode	character	A String name of the Ace Mode used for highlighting whenever a file is edited. This must match one of the filenames in http://git.io/3XO_Cg. Use "text" if a mode does not exist.
github_language_file_extensions	character	An Array of associated extensions (the first one is considered the primary extension, the others should be listed alphabetically).
github_language_repos	double	How many repos for this language does GitHub report?
wikipedia	character	URL of the entity on Wikipedia, if and only if it has a page dedicated to it.
wikipedia_daily_page_views	double	How many page views per day does this Wikipedia page get? Useful as a signal for rankings. Available via WP api.
wikipedia_backlinks_count	double	How many pages on WP link to this page?
wikipedia_summary	character	What is the text summary of the language from the Wikipedia page?
wikipedia_page_id	double	Waht is the internal ID for this entity on WP?
wikipedia_appeared	double	When does Wikipedia claim this entity first appeared?
wikipedia_created	double	When was the Wikipedia page for this entity created?
wikipedia_revision_count	double	How many revisions does this page have?
wikipedia_related	character	What languages does Wikipedia have as related?
features_has_comments	logical	Does this language have a comment character?
features_has_semantic_indentation	logical	Does indentation have semantic meaning in this language?
features_has_line_comments	logical	Does this language support inline comments (as opposed to comments that must span an entire line)?
line_comment_token	character	...

spinning-up-sac-ant-results
kaggle.com
zip
Updated Jul 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MoniGarrr (2025). spinning-up-sac-ant-results [Dataset]. https://www.kaggle.com/datasets/monigarrr/spinning-up-sac-ant-results
Explore at:
zip(2209 bytes)Available download formats
Dataset updated
Jul 20, 2025
Authors
MoniGarrr
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SAC Ant Results: OpenAI Spinning Up (Modernized PPO/SAC Baseline) Repository: github.com/monigarr/spinningup/tree/monigarr-dev

This dataset contains structured results from reinforcement learning (RL) experiments using the Soft Actor-Critic (SAC) algorithm on the Ant-v5 environment. It is part of MoniGarr's initiative to modernize and extend the original OpenAI Spinning Up codebase for current Python, PyTorch, and Gymnasium ecosystems.

The results include detailed logs of reward progression, hyperparameter configurations, evaluation summaries, and visualizations, all generated through reproducible experimentation using an updated and extensible RL workflow.

DATASET CONTENTS: File Name Description sac_ant_results.csv Epoch-level log of training rewards, timesteps, and key metrics sac_config.json Full configuration used for the SAC training run sac_eval_metrics.json Summary of evaluation metrics including reward and return sac_training_plot.png Reward curve visualization (training performance over time) experiment_notes.md Key observations and tuning notes from the experiments

METHODOLOGY - Cloned and refactored OpenAI’s Spinning Up repo - Replaced deprecated gym with gymnasium - Updated SAC implementation for compatibility with PyTorch 2.x - Ran long-horizon training on Ant-v5 with multiple seeds and checkpoints - Used custom logging for exportable CSV/JSON format results

INTENDED USES This dataset supports: * Baseline reproduction and RL benchmarking * Curriculum development in deep reinforcement learning * Comparative analysis of SAC vs. PPO/TD3 * Applied research, debugging, and educational tutorials

WHY THIS DATASET IS USEFUL Maintaining parity with evolving RL tools is essential for ensuring reproducibility and learning efficiency. This dataset: * Demonstrates SAC performance under modern configurations * Offers ready-to-use logs and plots for analysis and reporting * Enables faster experimentation for RL students and developers

PROJECT CONTEXT This work is part of MoniGarr's larger suite of open-source AI efforts focused on: * Modernizing legacy ML frameworks * Promoting accessible, well-documented reinforcement learning pipelines * Supporting low-resource developers and researchers with reproducible tools

GEOSPATIAL COVERAGE - Primary Location: Akwesasne NY, Akwesasne Ontario - Extended Context: Worldwide (Open-source reproducibility) - The dataset was generated in Akwesasne but it's intended for worldwide use in reproducible RL research and education. Since the data is synthetic and code-driven, there's no human subject or location-bound data involved.

ASSOCIATED PAPERS & SOURCES This dataset builds upon and modernizes results from:

SPINNING UP IN DEEP RL : OpenAI GitHub: https://github.com/openai/spinningup Paper: https://spinningup.openai.com/en/latest/spinningup.pdf

SOFT ACTOR-CRITIC ALGORITHMS Haarnoja et al., 2018 Paper: https://arxiv.org/abs/1801.01290 SAC Code Reference: https://github.com/denisyarats/pytorch_sac

EXPECTED UPDATE FREQUENCY Initial Release: Complete Updates: Occasional — only if benchmark improvements, environment changes, or additional baseline comparisons (e.g., TD3, PPO-Penalty) are added. Community Contributions: Welcome via GitHub PRs and issues.

RECOMMENDED COVERAGE - Reinforcement Learning education and experimentation - Benchmarking reproducible SAC performance on Ant-v5 - Use in papers, blogs, notebooks, or reproducibility studies - Modern RL code comparisons (Gym → Gymnasium, legacy → PyTorch 2.x)

If you find the dataset helpful, feel free to ⭐️ the repo or connect with @MoniGarr. https://github.com/monigarr/spinningup/tree/monigarr-dev
GitHub Code Snippets - Development sample
kaggle.com
zip
Updated Mar 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zomglings (2021). GitHub Code Snippets - Development sample [Dataset]. https://www.kaggle.com/simiotic/github-code-snippets-development-sample
Explore at:
zip(494262965 bytes)Available download formats
Dataset updated
Mar 7, 2021
Authors
zomglings
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a sample of approximately 5% of the full GitHub Code Snippets dataset.

The full dataset is over 60GB in size, and can be difficult to work with in Kaggle notebooks. We have released this development dataset for prototyping purposes so that you can test different ideas on a smaller dataset before running processing on the full dataset.

Issues and requests

This dataset is built and maintained by Bugout.dev. To report an issue with the data, to request changes in future versions of the dataset, please open a discussion thread under the full dataset..
Github User Analysis 2019 for Graph Dataset
kaggle.com
zip
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christofel Ganteng (2023). Github User Analysis 2019 for Graph Dataset [Dataset]. https://www.kaggle.com/datasets/christofel04/github-user-analysis-2019-for-graph-dataset
Explore at:
zip(2447769 bytes)Available download formats
Dataset updated
Sep 25, 2023
Authors
Christofel Ganteng
Description
[ GitHub User Analysis 2019 for Graph Dataset ]

This is GitHub User Analysis 2019 for Graph Dataset. A large social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This target feature was derived from the job title of each user.

Data Description :

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2941945%2F0297b539f7d955df091ebc19eee2d996%2FScreenshot%20from%202023-09-25%2016-30-37.png?generation=1695627254231053&alt=media" alt="">

GitHub User Analysis 2019 for Graph Dataset Tasks :

1. Can you predict GitHub User 2019 is a Software Engineer or AI Engineer based on GitHub User 2019 Analysis and GitHub Post and tendency ? 2. Can you predict GitHub User 2019 would follow AI Researcher based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 3. Can you predict GitHub user 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 4. Can you predict GitHub User 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post Tendency ? Try to Visualize GitHub User 2019 Analysis and Tendency and try to find GitHub User 2019 Analysis and Tendency Pattern.
GitHub Dataset
kaggle.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Raj (2023). GitHub Dataset [Dataset]. https://www.kaggle.com/nikhil25803/github-dataset
Explore at:
zip(79399228 bytes)Available download formats
Dataset updated
Mar 2, 2023
Authors
Nikhil Raj
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
We have two versions of dataset available

Version 1 Link

This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and issue count.

While working on a repository recommendation project, I curated this data by scraping around 18000+ repositories and filtered those that have at least one issue open so that we can recommend the user a repository to which he/she can contribute.

Columns repositories - the name of the repository (Format - github_username/repository_name) stars_count - stars count of the repository forks_count - fork count of the repository issues_count - active/opened issues in the repository pull_requests - pull requests opened in the repository contributors - contributors contribute to the project so far language - primary language used in the project

Version 2 Link

Found a JSON data on Kaggle, (link) and wrote a preprocessing function to convert them into a CSV file.

This is comparatively a bigger dataset, with 2917951 repositories data.

Columns name - the name of the repository stars_count - stars count of the repository forks_count - forks count of the repository watchers - watchers in the repository pull_requests - pull requests made in the repository primary_language - the primary language of the repository languages_used - list of all the languages used in the repository commit_count - commits made in the repository created_at - time and date when the repository was created license - license assigned to the repository.

Note The data in the dataset is from the time when it was scrapped, so any updates in the actual repository will not be reflected here.
AI vs. Human Code: A Comparative Dataset
kaggle.com
zip
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parul Jain (2025). AI vs. Human Code: A Comparative Dataset [Dataset]. https://www.kaggle.com/datasets/paruljain1024pp/ai-vs-human-code-a-comparative-dataset
Explore at:
zip(96401 bytes)Available download formats
Dataset updated
Feb 11, 2025
Authors
Parul Jain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the rise of AI-assisted coding tools like GitHub Copilot, ChatGPT, and Codeium, the debate between AI-generated vs. human-written code has gained momentum. This dataset provides a structured comparison of 5,000 code snippets across multiple programming languages and problem domains.

Facebook

Twitter

Click to copy link

Link copied

Cite

Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos

GitHub Repos

Code and comments from 2.8 million repos

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset provided by

GitHubhttps://github.com/

Authors

Github

Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.
Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?

Clear search

Close search

Google apps

Main menu

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

Github Dataset - Data Analysis

Dataset

Contents

GitHub Programming Languages Data

Context

Content

Source

Limitations

GitHub Public Repository Metadata

Example entry

github-final-datasets

Github Clean Code Snippets Dataset

1 Step - Github Samples Database parsing

2 Step - Github Bigquery Database parsing

3 Step - collection of code samples of raw coding samples

4 Step - First and second datasets combining

5 Step - Datasets cleaning from symbols and merging together with rare languages

6 Step - Fixing up the labels

Google Universal Embedding Challenge Github Repo

Universal Embedding Challenge baseline model implementation.

Github Organizations - Social Network Analysis

About the Dataset

Tasks

Awesome Dataset Repository on GitHub

Dataset

Contents

Learning Path Index Dataset

Description

Inspiration

Context

Sources

Content

How to contribute to this initiative?

License

Important Links

Credits

Real Indian users on Github

📊 GitHub Indian Users Dataset

Overview

🧑‍💻 Dataset Contents

🌟 Source and Inspiration

Potential Use Cases

💡 Ideal for

Cat_VS_Dog_Model

GitHub Repository Metadata

GitHub Social Network

Open-Source GitHub Repos: Stars, Issues & PRs

Introduction to the Data and Fetching Process

Repositories Data Dictionary

Issues Data Dictionary

Pull Requests Data Dictionary

Programming Languages

languages.csv

spinning-up-sac-ant-results

GitHub Code Snippets - Development sample

Issues and requests

Github User Analysis 2019 for Graph Dataset

GitHub Dataset

We have two versions of dataset available

Version 1 Link

Version 2 Link

AI vs. Human Code: A Comparative Dataset

GitHub ReposSee More Versions

Code and comments from 2.8 million repos

Querying BigQuery tables

Acknowledgements

Inspiration

`languages.csv`

GitHub Repos