100+ datasets found

GitHub Public Repository Metadata
kaggle.com
zip
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars
Explore at:
zip(606866859 bytes)Available download formats
Dataset updated
Oct 26, 2025
Authors
Peter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

The Github API Terms of Service apply.

You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

Example entry

{ "owner": "pelmers", "name": "text-rewriter", "stars": 13, "forks": 5, "watchers": 4, "isFork": false, "isArchived": false, "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ], "languageCount": 3, "topics": [ { "name": "chrome-extension", "stars": 43211 } ], "topicCount": 1, "diskUsageKb": 75, "pullRequests": 4, "issues": 12, "description": "Webextension to rewrite phrases in pages", "primaryLanguage": "JavaScript", "createdAt": "2015-03-14T22:35:11Z", "pushedAt": "2022-02-11T14:26:00Z", "defaultBranchCommitCount": 54, "license": null, "assignableUserCount": 1, "codeOfConduct": null, "forkingAllowed": true, "nameWithOwner": "pelmers/text-rewriter", "parent": null }

The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.
GitHub Dataset
kaggle.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Raj (2023). GitHub Dataset [Dataset]. https://www.kaggle.com/nikhil25803/github-dataset
Explore at:
zip(79399228 bytes)Available download formats
Dataset updated
Mar 2, 2023
Authors
Nikhil Raj
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
We have two versions of dataset available

Version 1 Link

This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and issue count.

While working on a repository recommendation project, I curated this data by scraping around 18000+ repositories and filtered those that have at least one issue open so that we can recommend the user a repository to which he/she can contribute.

Columns repositories - the name of the repository (Format - github_username/repository_name) stars_count - stars count of the repository forks_count - fork count of the repository issues_count - active/opened issues in the repository pull_requests - pull requests opened in the repository contributors - contributors contribute to the project so far language - primary language used in the project

Version 2 Link

Found a JSON data on Kaggle, (link) and wrote a preprocessing function to convert them into a CSV file.

This is comparatively a bigger dataset, with 2917951 repositories data.

Columns name - the name of the repository stars_count - stars count of the repository forks_count - forks count of the repository watchers - watchers in the repository pull_requests - pull requests made in the repository primary_language - the primary language of the repository languages_used - list of all the languages used in the repository commit_count - commits made in the repository created_at - time and date when the repository was created license - license assigned to the repository.

Note The data in the dataset is from the time when it was scrapped, so any updates in the actual repository will not be reflected here.
spinning-up-sac-ant-results
kaggle.com
zip
Updated Jul 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MoniGarrr (2025). spinning-up-sac-ant-results [Dataset]. https://www.kaggle.com/datasets/monigarrr/spinning-up-sac-ant-results
Explore at:
zip(2209 bytes)Available download formats
Dataset updated
Jul 20, 2025
Authors
MoniGarrr
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SAC Ant Results: OpenAI Spinning Up (Modernized PPO/SAC Baseline) Repository: github.com/monigarr/spinningup/tree/monigarr-dev

This dataset contains structured results from reinforcement learning (RL) experiments using the Soft Actor-Critic (SAC) algorithm on the Ant-v5 environment. It is part of MoniGarr's initiative to modernize and extend the original OpenAI Spinning Up codebase for current Python, PyTorch, and Gymnasium ecosystems.

The results include detailed logs of reward progression, hyperparameter configurations, evaluation summaries, and visualizations, all generated through reproducible experimentation using an updated and extensible RL workflow.

DATASET CONTENTS: File Name Description sac_ant_results.csv Epoch-level log of training rewards, timesteps, and key metrics sac_config.json Full configuration used for the SAC training run sac_eval_metrics.json Summary of evaluation metrics including reward and return sac_training_plot.png Reward curve visualization (training performance over time) experiment_notes.md Key observations and tuning notes from the experiments

METHODOLOGY - Cloned and refactored OpenAI’s Spinning Up repo - Replaced deprecated gym with gymnasium - Updated SAC implementation for compatibility with PyTorch 2.x - Ran long-horizon training on Ant-v5 with multiple seeds and checkpoints - Used custom logging for exportable CSV/JSON format results

INTENDED USES This dataset supports: * Baseline reproduction and RL benchmarking * Curriculum development in deep reinforcement learning * Comparative analysis of SAC vs. PPO/TD3 * Applied research, debugging, and educational tutorials

WHY THIS DATASET IS USEFUL Maintaining parity with evolving RL tools is essential for ensuring reproducibility and learning efficiency. This dataset: * Demonstrates SAC performance under modern configurations * Offers ready-to-use logs and plots for analysis and reporting * Enables faster experimentation for RL students and developers

PROJECT CONTEXT This work is part of MoniGarr's larger suite of open-source AI efforts focused on: * Modernizing legacy ML frameworks * Promoting accessible, well-documented reinforcement learning pipelines * Supporting low-resource developers and researchers with reproducible tools

GEOSPATIAL COVERAGE - Primary Location: Akwesasne NY, Akwesasne Ontario - Extended Context: Worldwide (Open-source reproducibility) - The dataset was generated in Akwesasne but it's intended for worldwide use in reproducible RL research and education. Since the data is synthetic and code-driven, there's no human subject or location-bound data involved.

ASSOCIATED PAPERS & SOURCES This dataset builds upon and modernizes results from:

SPINNING UP IN DEEP RL : OpenAI GitHub: https://github.com/openai/spinningup Paper: https://spinningup.openai.com/en/latest/spinningup.pdf

SOFT ACTOR-CRITIC ALGORITHMS Haarnoja et al., 2018 Paper: https://arxiv.org/abs/1801.01290 SAC Code Reference: https://github.com/denisyarats/pytorch_sac

EXPECTED UPDATE FREQUENCY Initial Release: Complete Updates: Occasional — only if benchmark improvements, environment changes, or additional baseline comparisons (e.g., TD3, PPO-Penalty) are added. Community Contributions: Welcome via GitHub PRs and issues.

RECOMMENDED COVERAGE - Reinforcement Learning education and experimentation - Benchmarking reproducible SAC performance on Ant-v5 - Use in papers, blogs, notebooks, or reproducibility studies - Modern RL code comparisons (Gym → Gymnasium, legacy → PyTorch 2.x)

If you find the dataset helpful, feel free to ⭐️ the repo or connect with @MoniGarr. https://github.com/monigarr/spinningup/tree/monigarr-dev
h
git-diff_to_commit_msg
huggingface.co
kaggle.com
Updated Oct 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epasinghe (2025). git-diff_to_commit_msg [Dataset]. https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg
Explore at:
Dataset updated
Oct 5, 2025
Authors
Epasinghe
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Hi, I’m Seniru Epasinghe 👋

I’m an AI undergraduate and an AI enthusiast, working on machine learning projects and open-source contributions.I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.

🌐 Connect with me There are 2 version of this dataset:

git-diff_to_commit_msg - 1.5K rows huggingface link kaggle link

git-diff_to_commit_msg_large - 1.75M rows huggingface link kaggle link… See the full description on the dataset page: https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg.
R
Custom Yolov7 On Kaggle On Custom Dataset
universe.roboflow.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Car Bounding Boxes
Description
Custom Training with YOLOv7 🔥

Some Important links

Model Inference🤖

🚀Training Yolov7 on Kaggle

Weight and Biases 🐝

HuggingFace 🤗 Model Repo

Contact Information

Name - Owais Ahmad

Phone - +91-9515884381

Email - owaiskhan9654@gmail.com

Portfolio - https://owaiskhan9654.github.io/

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

Link to the Downloadable Dataset

from IPython.display import Markdown, display display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

Export the dataset to YOLOv7

Train YOLOv7 to recognize the objects in our dataset

Evaluate our YOLOv7 model's performance

Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements %cd yolov7 !pip install -qr requirements.txt !pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

import os import glob import wandb import torch from roboflow import Roboflow from kaggle_secrets import UserSecretsClient from IPython.display import Image, clear_output, display # to display images print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try: user_secrets = UserSecretsClient() wandb_api_key = user_secrets.get_secret("wandb_api") wandb.login(key=wandb_api_key) anonymous = None except: wandb.login(anonymous='must') print('To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize') wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Convert an existing Coco dataset to YOLOv7 format. In Roboflow it supports over 30 formats object detection formats for conversion.

Uploading only these raw images and annotate them in Roboflow with Roboflow Annotate.

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient() roboflow_api_key = user_secrets.get_secret("roboflow_api")

rf = Roboflow(api_key=roboflow_api_key) project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq") dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
git-diff_to_commit_msg_large
kaggle.com
huggingface.co
zip
Updated Oct 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
seniru epasinghe (2025). git-diff_to_commit_msg_large [Dataset]. https://www.kaggle.com/datasets/seniruepasinghe/git-diff-to-commit-msg-large
Explore at:
zip(433820173 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
seniru epasinghe
Description
There are 2 version of this dataset:

git-diff_to_commit_msg - 1.5K rows

huggingface link

kaggle link

git-diff_to_commit_msg_large - 1.75M rows

huggingface link

kaggle link

This is the git-diff_to_commit_msg_large - 1.75M rows version

Custom dataset created with commits from public repositories and also adding chatGPT-4o generated commit message for each commit

Consist of 15 programming languages

java, ruby, go, javascript, python, php, html, css, sql, typescript, C#, C++, XML, rust, swift

Can be used to train models, finetune LLMs or for validation purposes
Google Landmarks Dataset v2
github.com
opendatalab.com
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
Explore at:
Dataset updated
Sep 27, 2019
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
h
FER-datasets
huggingface.co
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minh-Thien Nguyen (2025). FER-datasets [Dataset]. https://huggingface.co/datasets/minhnguyent546/FER-datasets
Explore at:
Dataset updated
Nov 1, 2025
Authors
Minh-Thien Nguyen
Description
FER Datasets

Dataset Link

FER2013 kaggle

CAER-S caer-dataset.github.io

AffectNet kaggle

RAF-DB kaggle
h
UQABench
huggingface.co
kaggle.com
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenStellarTeam (2025). UQABench [Dataset]. https://huggingface.co/datasets/OpenStellarTeam/UQABench
Explore at:
Dataset updated
Oct 6, 2025
Authors
OpenStellarTeam
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
[KDD'25] UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering [KDD 2025 Accepted (Oral) Paper]

Overview

The paper link: UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering. Github: https://github.com/OpenStellarTeam/UQABench The source data (Kaggle): Kaggle

Description

The UQABench is a benchmark dataset for evaluating user embeddings in prompting LLMs for personalized question answering. The… See the full description on the dataset page: https://huggingface.co/datasets/OpenStellarTeam/UQABench.
Space X to Y Data Analysis & Landing Prediction
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Britta Smith (2023). Space X to Y Data Analysis & Landing Prediction [Dataset]. https://www.kaggle.com/brittasmith/spacextoy-dataanalysis-launchprediction
Explore at:
zip(5536699 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
Britta Smith
Description
GitHub Project Link: Space X to Y

Peer Audience Presentation Slides: See PDF uploaded below

Tableau Dashboard Link: Space X to Y

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10293677%2Fa6d81c06dc03412bfd063941bd1dfa18%2Fspacex-falcon9-reaching-orbit-wide.jpg?generation=1672337964521833&alt=media" alt="">
Machine Learning users on Github
kaggle.com
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
prosper chuks (2022). Machine Learning users on Github [Dataset]. https://www.kaggle.com/prosperchuks/machine-learning-users-on-github
Explore at:
zip(52282 bytes)Available download formats
Dataset updated
Jan 9, 2022
Authors
prosper chuks
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Data was scraped from Github's API.

Columns

LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user

This dataset contains over 600 users from Lagos, Nigeria and Rwanda

Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data
GitHub Social Network
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gitanjali Wadhwa
Description
Description

An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

Properties

Directed: No.

Node features: Yes.

Edge features: No.

Node labels: Yes. Binary-labeled.

Temporal: No.

Nodes: 37,700

Edges: 289,003

Density: 0.001

Transitvity: 0.013

Possible Tasks

Binary node classification

Link prediction

Community detection

Network visualisation
GitHub Topics Star Count
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu (2022). GitHub Topics Star Count [Dataset]. https://www.kaggle.com/datasets/jishnukoliyadan/github-topics-star-count
Explore at:
zip(421706 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
Jishnu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About the data

This is a scrapped dataset from https://github.com/topics.

Contains list of topics title, user name, repository name, link to the repository and the star count.

For each topic tried to scrapped top 120 GitHub repository information.

The scrapped data is made available to public make use of it and appreciated the GitHub users for their contribution to Open Source.

How the data is collected ?

The data was collected with the help of Selenium and BeautifulSoup libraries in Python.

The web scrapping notebook is name available here for any reference for the future web scrapping projects.

Image Credit : Roman Synkevych
Event Data of Popular Open Source Go Repositories
kaggle.com
zip
Updated Jul 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentina A (2025). Event Data of Popular Open Source Go Repositories [Dataset]. https://www.kaggle.com/datasets/valentinaask/event-data-of-popular-open-source-go-repositories/data
Explore at:
zip(2824594791 bytes)Available download formats
Dataset updated
Jul 13, 2025
Authors
Valentina A
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

The dataset contains event data from the 100 most popular GitHub repositories using the Go programming language as the basis, as of 2024. The hash of the commit is supplemented with the unique identifier of each action. It is linked to all the data provided by Github that connected with particular commit.

Logical Data Model https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7147053%2F69a52e8816b226a0fe6cec90e858d576%2F111.jpg?generation=1752428663053328&alt=media" alt="Logical Model">
Github Repos
kaggle.com
zip
Updated Mar 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhneshd Dhingra (2022). Github Repos [Dataset]. https://www.kaggle.com/datasets/dhneshddhingra/github-repos/discussion
Explore at:
zip(8586547 bytes)Available download formats
Dataset updated
Mar 8, 2022
Authors
Dhneshd Dhingra
Description
Description Over half a million records for Github OpenSource Projects

About Dataset Dataset Includes Github Repo Link, Its Stars, Forks Count, Issues Count, Languages used etc
Arxiv NLP papers with Github link
kaggle.com
zip
Updated Jan 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shujian Liu (2019). Arxiv NLP papers with Github link [Dataset]. https://www.kaggle.com/datasets/shujian/arxiv-nlp-papers-with-github-link
Explore at:
zip(45702 bytes)Available download formats
Dataset updated
Jan 24, 2019
Authors
Shujian Liu
Description
Dataset

This dataset was created by Shujian Liu

Contents
iterative-stratification
kaggle.com
zip
Updated Oct 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wojciech "Victor" Fulmyk (2022). iterative-stratification [Dataset]. https://www.kaggle.com/datasets/wisawesome/iterativestratification
Explore at:
zip(74738 bytes)Available download formats
Dataset updated
Oct 30, 2022
Authors
Wojciech "Victor" Fulmyk
Description
iterative-stratification

From: https://github.com/trent-b/iterative-stratification

Downloaded using: sudo apt-get install git-lfs git lfs install

git clone https://github.com/trent-b/iterative-stratification

The objective of this upload: - use the included Python library in Kaggle competitions without needing to connect to the internet.

There is no intention to infringe rights of any kind on my part, I simply want to use this library in competitions that require no internet connection. If you are one of the rights holders for this library and you feel your rights are being infringed by this upload, please contact me and I will rectify the issue as soon as possible.

The original source code, of which this is an exact copy, was relased under the BSD-3-Clause license. See the github repository for details.
AI Agents Dataset — GitHub Repositories, Use Cases
kaggle.com
zip
Updated Oct 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alam Shihab (2025). AI Agents Dataset — GitHub Repositories, Use Cases [Dataset]. https://www.kaggle.com/datasets/alamshihab075/ai-agents-dataset-github-repositories-use-cases
Explore at:
zip(4554 bytes)Available download formats
Dataset updated
Oct 22, 2025
Authors
Alam Shihab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AI Agents Dataset — GitHub Repositories, Use Cases

Subtitle

Curated list of AI agent use-cases with direct links to GitHub implementations.

Overview

A compact, curated CSV of AI agent projects with a short title, industry tag, one-line description, and a direct GitHub link. The dataset is intended as a discovery and enrichment seed for engineers, researchers, and educators looking for real-world agent code examples.

Dataset Summary

Statistic Value
Records 71
Columns Use Case, Industry, Description, Code Github

How It Was Collected

Entries were gathered from publicly available GitHub repositories and project examples. Links may point to full repos, notebooks, or subfolders and were captured as-is.

Suggested Uses

This dataset is mostly useful for RAG based systems to automate AI Agent building or coding.

Minimal Notes & Limitations

Link Stability: Links were valid at curation time but can change — verify before automated use.

License Heterogeneity: Each linked repo has its own license; respect those licenses when using code.

Not a Complete Corpus: This is a metadata seed (short descriptions + links). For training or indexing, fetch full README and source files to enrich the dataset.

License

This dataset is provided under CC0 1.0 Universal (public domain). Linked repositories are governed by their own licenses.
CRACK500-20220509T090436Z-001
kaggle.com
zip
Updated May 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul David22 (2022). CRACK500-20220509T090436Z-001 [Dataset]. https://www.kaggle.com/datasets/pauldavid22/crack50020220509t090436z001
Explore at:
zip(1643195204 bytes)Available download formats
Dataset updated
May 9, 2022
Authors
Paul David22
Description
Original Source Link

Below Excerpts taken from the above Github Repo

Pavement crack detection: dataset and model

The project is used to share our recent work on pavement crack detection. For the details of the work, the readers are refer to the paper "Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection" (FPHB), T-ITS 2019. You can find the paper in https://www.researchgate.net/publication/330244656_Feature_Pyramid_and_Hierarchical_Boosting_Network_for_Pavement_Crack_Detection or https://arxiv.org/abs/1901.06340.

The pavement crack datasets used in paper, crack detection results on each datasets, trained model, and crack annotation tool are stored in Google Drive, One Drive, and Daidu Yunpan extract code: jviq.
KaggleMoviesDataSetCleaned
kaggle.com
zip
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
S Hooper (2025). KaggleMoviesDataSetCleaned [Dataset]. https://www.kaggle.com/susannahooper/kagglemoviesdatasetcleaned
Explore at:
zip(11192292 bytes)Available download formats
Dataset updated
Feb 21, 2025
Authors
S Hooper
Description
Link to the Kaggle Movies Data Set cleaned with the process specified in the notebook at this GItHub link: https://github.com/shoopy7/shoopy7/blob/main/notebooks/KaggleMoviesDatasetForFeatureExtractionPredictionNtbk.ipynb

Rights to Kaggle's current policy. This is a Kaggle Dataset modified for practice in cleaning.

Statistic	Value
Records	71
Columns	Use Case, Industry, Description, Code Github

Facebook

Twitter

Click to copy link

Link copied

Cite

Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars

GitHub Public Repository Metadata

Metadata (i.e. no code) of all public repositories with 5+ stars on GitHub

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

zip(606866859 bytes)Available download formats

Dataset updated

Oct 26, 2025

Authors

Peter

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

The Github API Terms of Service apply.

You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

Example entry

{
 "owner": "pelmers",
 "name": "text-rewriter",
 "stars": 13,
 "forks": 5,
 "watchers": 4,
 "isFork": false,
 "isArchived": false,
 "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
 "languageCount": 3,
 "topics": [ { "name": "chrome-extension", "stars": 43211 } ],
 "topicCount": 1,
 "diskUsageKb": 75,
 "pullRequests": 4,
 "issues": 12,
 "description": "Webextension to rewrite phrases in pages",
 "primaryLanguage": "JavaScript",
 "createdAt": "2015-03-14T22:35:11Z",
 "pushedAt": "2022-02-11T14:26:00Z",
 "defaultBranchCommitCount": 54,
 "license": null,
 "assignableUserCount": 1,
 "codeOfConduct": null,
 "forkingAllowed": true,
 "nameWithOwner": "pelmers/text-rewriter",
 "parent": null
}

The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.

Clear search

Close search

Google apps

Main menu

GitHub Public Repository Metadata

Example entry

GitHub Dataset

We have two versions of dataset available

Version 1 Link

Version 2 Link

spinning-up-sac-ant-results

git-diff_to_commit_msg

Custom Yolov7 On Kaggle On Custom Dataset

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

Custom Training with YOLOv7 🔥

📦 YOLOv7

Step 1: Install Requirements

Downloading YOLOV7 starting checkpoint

Step 2: Assemble Our Dataset

Version v2 Aug 12, 2022 Looks like this.

Step 3: Training Custom pretrained YOLOv7 model

git-diff_to_commit_msg_large

There are 2 version of this dataset:

This is the git-diff_to_commit_msg_large - 1.75M rows version

Custom dataset created with commits from public repositories and also adding chatGPT-4o generated commit message for each commit

Consist of 15 programming languages

Google Landmarks Dataset v2

FER-datasets

UQABench

Space X to Y Data Analysis & Landing Prediction

GitHub Project Link: Space X to Y

Peer Audience Presentation Slides: See PDF uploaded below

Tableau Dashboard Link: Space X to Y

Machine Learning users on Github

Columns

GitHub Social Network

GitHub Topics Star Count

About the data

How the data is collected ?

Event Data of Popular Open Source Go Repositories

Overview

Github Repos

Arxiv NLP papers with Github link

Dataset

Contents

iterative-stratification

AI Agents Dataset — GitHub Repositories, Use Cases

AI Agents Dataset — GitHub Repositories, Use Cases

Subtitle

Overview

Dataset Summary

How It Was Collected

Suggested Uses

Minimal Notes & Limitations

License

CRACK500-20220509T090436Z-001

Original Source Link

Below Excerpts taken from the above Github Repo

Pavement crack detection: dataset and model

KaggleMoviesDataSetCleaned

GitHub Public Repository Metadata

Metadata (i.e. no code) of all public repositories with 5+ stars on GitHub

Example entry