Facebook
TwitterGitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Abhishek Ranjan
Released under Database: Open Database, Contents: Database Contents
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.
One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.
This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.
This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.
Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.
The Github API Terms of Service apply.
You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.
Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.
{
"owner": "pelmers",
"name": "text-rewriter",
"stars": 13,
"forks": 5,
"watchers": 4,
"isFork": false,
"isArchived": false,
"languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
"languageCount": 3,
"topics": [ { "name": "chrome-extension", "stars": 43211 } ],
"topicCount": 1,
"diskUsageKb": 75,
"pullRequests": 4,
"issues": 12,
"description": "Webextension to rewrite phrases in pages",
"primaryLanguage": "JavaScript",
"createdAt": "2015-03-14T22:35:11Z",
"pushedAt": "2022-02-11T14:26:00Z",
"defaultBranchCommitCount": 54,
"license": null,
"assignableUserCount": 1,
"codeOfConduct": null,
"forkingAllowed": true,
"nameWithOwner": "pelmers/text-rewriter",
"parent": null
}
The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.
The first part of the code samples was taken from a private version of this notebook.
Here is the statistics about classes of programming languages from Github Code Snippets database
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">
From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.
Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.
The resulted file is dataset-10000.csv - included to the data card
The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">
To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card
The statistics for rare languages code snippets is as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">
For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv
To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv
After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv
The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.
The final distribution of classes turned out to be the next one
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">
To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Facebook
TwitterThis folder contains the baseline model implementation for the Kaggle universal image embedding challenge based on
Following the above ideas, we also add a 64 projection layer on top of the Vision Transformer base model as the final embedding, since the competition requires embeddings of at most 64 dimensions. Please find more details in image_classification.py.
To use the code, please firstly install the prerequisites
pip install -r universal_embedding_challenge/requirements.txt
git clone https://github.com/tensorflow/models.git /tmp/models
export PYTHONPATH=$PYTHONPATH:/tmp/models
pip install --user -r /tmp/models/official/requirements.txt
Secondly, please download the imagenet1k data in TFRecord format from https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0 and https://www.kaggle.com/datasets/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-1, and merge them together under folder imagenet-2012-tfrecord/. As a result, the paths to the training datasets and the validation datasets should be imagenet-2012-tfrecord/train* and imagenet-2012-tfrecord/validation*, respectively.
The trainer for the model is implemented in train.py, and the following example launches the training
python -m universal_embedding_challenge.train \
--experiment=vit_with_bottleneck_imagenet_pretrain \
--mode=train_and_eval \
--model_dir=/tmp/imagenet1k_test
The trained model checkpoints could be further converted to savedModel format using export_saved_model.py for Kaggle submission.
The code to compute metrics for Universal Embedding Challenge is implemented in metrics.py and the code to read the solution file is implemented in read_retrieval_solution.py.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A dataset containing the List of top Organizations on Github and their contributors.
Primarily Intended for Social Network analysis. Check out the Starter Notebook for ref.
Facebook
TwitterThis dataset was created by Rajesh Kumar Pandey
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.
This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.
This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.
This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.
The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.
The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.
The dataset encompasses the following attributes:
The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.
Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.
The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)
This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.
This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling
Facebook
TwitterThis dataset is from the GitHub repo below. You can use a model trained with Dogs vs. Cats dataset on Kaggle. GitHub: https://github.com/amitrajitbose/cat-v-dog-classifier-pytorch Kaggle Competition: https://www.kaggle.com/c/dogs-vs-cats/data
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset captures the metadata of 14,000+ repositories across GitHub. You’ll find everything from stars and forks to health scores, README previews, and language breakdowns.
It’s ideal for: - Identifying repo trends over time - Comparing popular vs. low-engagement projects - Exploring what makes a repo “healthy”
Perfect for learning data cleaning, analysis, and visualization using real-world project metadata.
Facebook
TwitterDescription
An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.
Properties
Possible Tasks
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset comprises detailed information about GitHub repositories, issues, and pull requests, collected using the GitHub API. The data includes repository metadata (such as stars, forks, and open issues), along with historical data on issues and pull requests (PRs), including their creation, closure, and merging timelines.
This dataset contains information about GitHub repositories, including metadata such as stars, forks, and activity status.
| Column Name | Data Type | Description |
|---|---|---|
id | object | Unique identifier for the repository. |
name | object | Name of the repository (e.g., "docker"). |
full_name | object | Full name of the repository (e.g., "prometheus/alertmanager"). |
description | object | Description of the repository, may be empty. |
stars | int64 | Number of stars the repository has. |
forks | int64 | Number of times the repository has been forked. |
open_issues | int64 | Number of open issues in the repository. |
created_at | datetime | Date and time when the repository was created. |
updated_at | datetime | Date and time when the repository was last updated. |
size_category | object | Categorization of the repository based on the number of stars (micro, small, medium, large, mega). |
stale | bool | Boolean flag indicating if the repository is "stale" (hasn't been updated in over 6 months). |
stars_per_fork | float64 | Number of stars per fork (calculated). |
stars_per_issue | float64 | Number of stars per open issue (calculated). |
contributor_per_star | float64 | Number of contributors per star (calculated). |
total_contributors | int64 | Total number of contributors from issues and pull requests. |
This dataset contains details of issues raised in the repositories, including information about their creation, closing, and state.
| Column Name | Data Type | Description |
|---|---|---|
id | object | Unique identifier for the issue. |
created_at | datetime | Date and time when the issue was created. |
updated_at | datetime | Date and time when the issue was last updated. |
closed_at | datetime | Date and time when the issue was closed (optional, null if open). |
number | int64 | Issue number in the GitHub repository. |
repository | object | The repository that the issue belongs to (name). |
state | object | Current state of the issue (either "open" or "closed"). |
title | object | Title of the issue. |
resolution_time_days | float64 | Number of days taken to resolve the issue (calculated, -1 for unresolved issues). |
This dataset contains information about pull requests (PRs) in the repositories, including metadata such as their state, creation, closing, and merging time.
| Column Name | Data Type | Description |
|---|---|---|
id | object | Unique identifier for the pull request. |
created_at | datetime | Date and time when the pull request was created. |
updated_at | datetime | Date and time when the pull request was last updated. |
closed_at | datetime | Date and time when the pull request was closed (optional, null if open). |
merged_at | datetime | Date and time when the pull request was merged (optional, null if not merge... |
Facebook
TwitterThe Dataset comes from Programming Languages Database
languages.csvThe full data dictionary is available from PLDB.com.
| variable | class | description |
|---|---|---|
| pldb_id | character | A standardized, uniquified version of the language name, used as an ID on the PLDB site. |
| title | character | The official title of the language. |
| description | character | Description of the repo on GitHub. |
| type | character | Which category in PLDB's subjective ontology does this entity fit into. |
| appeared | double | What year was the language publicly released and/or announced? |
| creators | character | Name(s) of the original creators of the language delimited by " and " |
| website | character | URL of the official homepage for the language project. |
| domain_name | character | If the project website is on its own domain. |
| domain_name_registered | double | When was this domain first registered? |
| reference | character | A link to more info about this entity. |
| isbndb | double | Books about this language from ISBNdb. |
| book_count | double | Computed; the number of books found for this language at isbndb.com |
| semantic_scholar | integer | Papers about this language from Semantic Scholar. |
| language_rank | double | Computed; A rank for the language, taking into account various online rankings. The computation for this column is not currently clear. |
| github_repo | character | URL of the official GitHub repo for the project if it hosted there. |
| github_repo_stars | double | How many stars of the repo? |
| github_repo_forks | double | How many forks of the repo? |
| github_repo_updated | double | What year was the last commit made? |
| github_repo_subscribers | double | How many subscribers to the repo? |
| github_repo_created | double | When was the Github repo for this entity created? |
| github_repo_description | character | Description of the repo on GitHub. |
| github_repo_issues | double | How many isses on the repo? |
| github_repo_first_commit | double | What year the first commit made in this git repo? |
| github_language | character | GitHub has a set of supported languages as defined here |
| github_language_tm_scope | character | The TextMate scope that represents this programming language. |
| github_language_type | character | Either data, programming, markup, prose, or nil. |
| github_language_ace_mode | character | A String name of the Ace Mode used for highlighting whenever a file is edited. This must match one of the filenames in http://git.io/3XO_Cg. Use "text" if a mode does not exist. |
| github_language_file_extensions | character | An Array of associated extensions (the first one is considered the primary extension, the others should be listed alphabetically). |
| github_language_repos | double | How many repos for this language does GitHub report? |
| wikipedia | character | URL of the entity on Wikipedia, if and only if it has a page dedicated to it. |
| wikipedia_daily_page_views | double | How many page views per day does this Wikipedia page get? Useful as a signal for rankings. Available via WP api. |
| wikipedia_backlinks_count | double | How many pages on WP link to this page? |
| wikipedia_summary | character | What is the text summary of the language from the Wikipedia page? |
| wikipedia_page_id | double | Waht is the internal ID for this entity on WP? |
| wikipedia_appeared | double | When does Wikipedia claim this entity first appeared? |
| wikipedia_created | double | When was the Wikipedia page for this entity created? |
| wikipedia_revision_count | double | How many revisions does this page have? |
| wikipedia_related | character | What languages does Wikipedia have as related? |
| features_has_comments | logical | Does this language have a comment character? |
| features_has_semantic_indentation | logical | Does indentation have semantic meaning in this language? |
| features_has_line_comments | logical | Does this language support inline comments (as opposed to comments that must span an entire line)? |
| line_comment_token | character | ... |
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SAC Ant Results: OpenAI Spinning Up (Modernized PPO/SAC Baseline) Repository: github.com/monigarr/spinningup/tree/monigarr-dev
This dataset contains structured results from reinforcement learning (RL) experiments using the Soft Actor-Critic (SAC) algorithm on the Ant-v5 environment. It is part of MoniGarr's initiative to modernize and extend the original OpenAI Spinning Up codebase for current Python, PyTorch, and Gymnasium ecosystems.
The results include detailed logs of reward progression, hyperparameter configurations, evaluation summaries, and visualizations, all generated through reproducible experimentation using an updated and extensible RL workflow.
DATASET CONTENTS: File Name Description sac_ant_results.csv Epoch-level log of training rewards, timesteps, and key metrics sac_config.json Full configuration used for the SAC training run sac_eval_metrics.json Summary of evaluation metrics including reward and return sac_training_plot.png Reward curve visualization (training performance over time) experiment_notes.md Key observations and tuning notes from the experiments
METHODOLOGY - Cloned and refactored OpenAI’s Spinning Up repo - Replaced deprecated gym with gymnasium - Updated SAC implementation for compatibility with PyTorch 2.x - Ran long-horizon training on Ant-v5 with multiple seeds and checkpoints - Used custom logging for exportable CSV/JSON format results
INTENDED USES This dataset supports: * Baseline reproduction and RL benchmarking * Curriculum development in deep reinforcement learning * Comparative analysis of SAC vs. PPO/TD3 * Applied research, debugging, and educational tutorials
WHY THIS DATASET IS USEFUL Maintaining parity with evolving RL tools is essential for ensuring reproducibility and learning efficiency. This dataset: * Demonstrates SAC performance under modern configurations * Offers ready-to-use logs and plots for analysis and reporting * Enables faster experimentation for RL students and developers
PROJECT CONTEXT This work is part of MoniGarr's larger suite of open-source AI efforts focused on: * Modernizing legacy ML frameworks * Promoting accessible, well-documented reinforcement learning pipelines * Supporting low-resource developers and researchers with reproducible tools
GEOSPATIAL COVERAGE - Primary Location: Akwesasne NY, Akwesasne Ontario - Extended Context: Worldwide (Open-source reproducibility) - The dataset was generated in Akwesasne but it's intended for worldwide use in reproducible RL research and education. Since the data is synthetic and code-driven, there's no human subject or location-bound data involved.
ASSOCIATED PAPERS & SOURCES This dataset builds upon and modernizes results from:
SPINNING UP IN DEEP RL : OpenAI GitHub: https://github.com/openai/spinningup Paper: https://spinningup.openai.com/en/latest/spinningup.pdf
SOFT ACTOR-CRITIC ALGORITHMS Haarnoja et al., 2018 Paper: https://arxiv.org/abs/1801.01290 SAC Code Reference: https://github.com/denisyarats/pytorch_sac
EXPECTED UPDATE FREQUENCY Initial Release: Complete Updates: Occasional — only if benchmark improvements, environment changes, or additional baseline comparisons (e.g., TD3, PPO-Penalty) are added. Community Contributions: Welcome via GitHub PRs and issues.
RECOMMENDED COVERAGE - Reinforcement Learning education and experimentation - Benchmarking reproducible SAC performance on Ant-v5 - Use in papers, blogs, notebooks, or reproducibility studies - Modern RL code comparisons (Gym → Gymnasium, legacy → PyTorch 2.x)
If you find the dataset helpful, feel free to ⭐️ the repo or connect with @MoniGarr. https://github.com/monigarr/spinningup/tree/monigarr-dev
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a sample of approximately 5% of the full GitHub Code Snippets dataset.
The full dataset is over 60GB in size, and can be difficult to work with in Kaggle notebooks. We have released this development dataset for prototyping purposes so that you can test different ideas on a smaller dataset before running processing on the full dataset.
This dataset is built and maintained by Bugout.dev. To report an issue with the data, to request changes in future versions of the dataset, please open a discussion thread under the full dataset..
Facebook
Twitter[ GitHub User Analysis 2019 for Graph Dataset ]
This is GitHub User Analysis 2019 for Graph Dataset. A large social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This target feature was derived from the job title of each user.
Data Description :
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2941945%2F0297b539f7d955df091ebc19eee2d996%2FScreenshot%20from%202023-09-25%2016-30-37.png?generation=1695627254231053&alt=media" alt="">
GitHub User Analysis 2019 for Graph Dataset Tasks :
1. Can you predict GitHub User 2019 is a Software Engineer or AI Engineer based on GitHub User 2019 Analysis and GitHub Post and tendency ? 2. Can you predict GitHub User 2019 would follow AI Researcher based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 3. Can you predict GitHub user 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post and Tendency ? 4. Can you predict GitHub User 2019 would make good publications based on GitHub User 2019 Analysis and GitHub Post Tendency ? Try to Visualize GitHub User 2019 Analysis and Tendency and try to find GitHub User 2019 Analysis and Tendency Pattern.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and issue count.
While working on a repository recommendation project, I curated this data by scraping around 18000+ repositories and filtered those that have at least one issue open so that we can recommend the user a repository to which he/she can contribute.
Columns
repositories - the name of the repository (Format - github_username/repository_name)
stars_count - stars count of the repository
forks_count - fork count of the repository
issues_count - active/opened issues in the repository
pull_requests - pull requests opened in the repository
contributors - contributors contribute to the project so far
language - primary language used in the project
Found a JSON data on Kaggle, (link) and wrote a preprocessing function to convert them into a CSV file.
This is comparatively a bigger dataset, with 2917951 repositories data.
Columns
name - the name of the repository
stars_count - stars count of the repository
forks_count - forks count of the repository
watchers - watchers in the repository
pull_requests - pull requests made in the repository
primary_language - the primary language of the repository
languages_used - list of all the languages used in the repository
commit_count - commits made in the repository
created_at - time and date when the repository was created
license - license assigned to the repository.
Note The data in the dataset is from the time when it was scrapped, so any updates in the actual repository will not be reflected here.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rise of AI-assisted coding tools like GitHub Copilot, ChatGPT, and Codeium, the debate between AI-generated vs. human-written code has gained momentum. This dataset provides a structured comparison of 5,000 code snippets across multiple programming languages and problem domains.
Facebook
TwitterGitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.