Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.
The Github API Terms of Service apply.
You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.
Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.
{
"owner": "pelmers",
"name": "text-rewriter",
"stars": 13,
"forks": 5,
"watchers": 4,
"isFork": false,
"isArchived": false,
"languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
"languageCount": 3,
"topics": [ { "name": "chrome-extension", "stars": 43211 } ],
"topicCount": 1,
"diskUsageKb": 75,
"pullRequests": 4,
"issues": 12,
"description": "Webextension to rewrite phrases in pages",
"primaryLanguage": "JavaScript",
"createdAt": "2015-03-14T22:35:11Z",
"pushedAt": "2022-02-11T14:26:00Z",
"defaultBranchCommitCount": 54,
"license": null,
"assignableUserCount": 1,
"codeOfConduct": null,
"forkingAllowed": true,
"nameWithOwner": "pelmers/text-rewriter",
"parent": null
}
The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and issue count.
While working on a repository recommendation project, I curated this data by scraping around 18000+ repositories and filtered those that have at least one issue open so that we can recommend the user a repository to which he/she can contribute.
Columns
repositories - the name of the repository (Format - github_username/repository_name)
stars_count - stars count of the repository
forks_count - fork count of the repository
issues_count - active/opened issues in the repository
pull_requests - pull requests opened in the repository
contributors - contributors contribute to the project so far
language - primary language used in the project
Found a JSON data on Kaggle, (link) and wrote a preprocessing function to convert them into a CSV file.
This is comparatively a bigger dataset, with 2917951 repositories data.
Columns
name - the name of the repository
stars_count - stars count of the repository
forks_count - forks count of the repository
watchers - watchers in the repository
pull_requests - pull requests made in the repository
primary_language - the primary language of the repository
languages_used - list of all the languages used in the repository
commit_count - commits made in the repository
created_at - time and date when the repository was created
license - license assigned to the repository.
Note The data in the dataset is from the time when it was scrapped, so any updates in the actual repository will not be reflected here.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SAC Ant Results: OpenAI Spinning Up (Modernized PPO/SAC Baseline) Repository: github.com/monigarr/spinningup/tree/monigarr-dev
This dataset contains structured results from reinforcement learning (RL) experiments using the Soft Actor-Critic (SAC) algorithm on the Ant-v5 environment. It is part of MoniGarr's initiative to modernize and extend the original OpenAI Spinning Up codebase for current Python, PyTorch, and Gymnasium ecosystems.
The results include detailed logs of reward progression, hyperparameter configurations, evaluation summaries, and visualizations, all generated through reproducible experimentation using an updated and extensible RL workflow.
DATASET CONTENTS: File Name Description sac_ant_results.csv Epoch-level log of training rewards, timesteps, and key metrics sac_config.json Full configuration used for the SAC training run sac_eval_metrics.json Summary of evaluation metrics including reward and return sac_training_plot.png Reward curve visualization (training performance over time) experiment_notes.md Key observations and tuning notes from the experiments
METHODOLOGY - Cloned and refactored OpenAI’s Spinning Up repo - Replaced deprecated gym with gymnasium - Updated SAC implementation for compatibility with PyTorch 2.x - Ran long-horizon training on Ant-v5 with multiple seeds and checkpoints - Used custom logging for exportable CSV/JSON format results
INTENDED USES This dataset supports: * Baseline reproduction and RL benchmarking * Curriculum development in deep reinforcement learning * Comparative analysis of SAC vs. PPO/TD3 * Applied research, debugging, and educational tutorials
WHY THIS DATASET IS USEFUL Maintaining parity with evolving RL tools is essential for ensuring reproducibility and learning efficiency. This dataset: * Demonstrates SAC performance under modern configurations * Offers ready-to-use logs and plots for analysis and reporting * Enables faster experimentation for RL students and developers
PROJECT CONTEXT This work is part of MoniGarr's larger suite of open-source AI efforts focused on: * Modernizing legacy ML frameworks * Promoting accessible, well-documented reinforcement learning pipelines * Supporting low-resource developers and researchers with reproducible tools
GEOSPATIAL COVERAGE - Primary Location: Akwesasne NY, Akwesasne Ontario - Extended Context: Worldwide (Open-source reproducibility) - The dataset was generated in Akwesasne but it's intended for worldwide use in reproducible RL research and education. Since the data is synthetic and code-driven, there's no human subject or location-bound data involved.
ASSOCIATED PAPERS & SOURCES This dataset builds upon and modernizes results from:
SPINNING UP IN DEEP RL : OpenAI GitHub: https://github.com/openai/spinningup Paper: https://spinningup.openai.com/en/latest/spinningup.pdf
SOFT ACTOR-CRITIC ALGORITHMS Haarnoja et al., 2018 Paper: https://arxiv.org/abs/1801.01290 SAC Code Reference: https://github.com/denisyarats/pytorch_sac
EXPECTED UPDATE FREQUENCY Initial Release: Complete Updates: Occasional — only if benchmark improvements, environment changes, or additional baseline comparisons (e.g., TD3, PPO-Penalty) are added. Community Contributions: Welcome via GitHub PRs and issues.
RECOMMENDED COVERAGE - Reinforcement Learning education and experimentation - Benchmarking reproducible SAC performance on Ant-v5 - Use in papers, blogs, notebooks, or reproducibility studies - Modern RL code comparisons (Gym → Gymnasium, legacy → PyTorch 2.x)
If you find the dataset helpful, feel free to ⭐️ the repo or connect with @MoniGarr. https://github.com/monigarr/spinningup/tree/monigarr-dev
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Hi, I’m Seniru Epasinghe 👋
I’m an AI undergraduate and an AI enthusiast, working on machine learning projects and open-source contributions.I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.
🌐 Connect with me
There are 2 version of this dataset:
git-diff_to_commit_msg - 1.5K rows huggingface link kaggle link
git-diff_to_commit_msg_large - 1.75M rows huggingface link kaggle link… See the full description on the dataset page: https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
Facebook
Twitterjava, ruby, go, javascript, python, php, html, css, sql, typescript, C#, C++, XML, rust, swift
Can be used to train models, finetune LLMs or for validation purposes
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
Facebook
TwitterFER Datasets
Dataset Link
FER2013 kaggle
CAER-S caer-dataset.github.io
AffectNet kaggle
RAF-DB kaggle
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
[KDD'25] UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering [KDD 2025 Accepted (Oral) Paper]
Overview
The paper link: UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering. Github: https://github.com/OpenStellarTeam/UQABench The source data (Kaggle): Kaggle
Description
The UQABench is a benchmark dataset for evaluating user embeddings in prompting LLMs for personalized question answering. The… See the full description on the dataset page: https://huggingface.co/datasets/OpenStellarTeam/UQABench.
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10293677%2Fa6d81c06dc03412bfd063941bd1dfa18%2Fspacex-falcon9-reaching-orbit-wide.jpg?generation=1672337964521833&alt=media" alt="">
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Data was scraped from Github's API.
LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user
This dataset contains over 600 users from Lagos, Nigeria and Rwanda
Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data
Facebook
TwitterDescription
An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.
Properties
Possible Tasks
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Image Credit : Roman Synkevych
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset contains event data from the 100 most popular GitHub repositories using the Go programming language as the basis, as of 2024. The hash of the commit is supplemented with the unique identifier of each action. It is linked to all the data provided by Github that connected with particular commit.
Logical Data Model
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7147053%2F69a52e8816b226a0fe6cec90e858d576%2F111.jpg?generation=1752428663053328&alt=media" alt="Logical Model">
Facebook
TwitterDescription Over half a million records for Github OpenSource Projects
About Dataset Dataset Includes Github Repo Link, Its Stars, Forks Count, Issues Count, Languages used etc
Facebook
TwitterThis dataset was created by Shujian Liu
Facebook
Twitteriterative-stratification
From: https://github.com/trent-b/iterative-stratification
Downloaded using: sudo apt-get install git-lfs git lfs install
git clone https://github.com/trent-b/iterative-stratification
The objective of this upload: - use the included Python library in Kaggle competitions without needing to connect to the internet.
There is no intention to infringe rights of any kind on my part, I simply want to use this library in competitions that require no internet connection. If you are one of the rights holders for this library and you feel your rights are being infringed by this upload, please contact me and I will rectify the issue as soon as possible.
The original source code, of which this is an exact copy, was relased under the BSD-3-Clause license. See the github repository for details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Curated list of AI agent use-cases with direct links to GitHub implementations.
A compact, curated CSV of AI agent projects with a short title, industry tag, one-line description, and a direct GitHub link. The dataset is intended as a discovery and enrichment seed for engineers, researchers, and educators looking for real-world agent code examples.
| Statistic | Value |
|---|---|
| Records | 71 |
| Columns | Use Case, Industry, Description, Code Github |
Entries were gathered from publicly available GitHub repositories and project examples. Links may point to full repos, notebooks, or subfolders and were captured as-is.
This dataset is mostly useful for RAG based systems to automate AI Agent building or coding.
This dataset is provided under CC0 1.0 Universal (public domain). Linked repositories are governed by their own licenses.
Facebook
TwitterThe project is used to share our recent work on pavement crack detection. For the details of the work, the readers are refer to the paper "Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection" (FPHB), T-ITS 2019. You can find the paper in https://www.researchgate.net/publication/330244656_Feature_Pyramid_and_Hierarchical_Boosting_Network_for_Pavement_Crack_Detection or https://arxiv.org/abs/1901.06340.
The pavement crack datasets used in paper, crack detection results on each datasets, trained model, and crack annotation tool are stored in Google Drive, One Drive, and Daidu Yunpan extract code: jviq.
Facebook
TwitterLink to the Kaggle Movies Data Set cleaned with the process specified in the notebook at this GItHub link: https://github.com/shoopy7/shoopy7/blob/main/notebooks/KaggleMoviesDatasetForFeatureExtractionPredictionNtbk.ipynb
Rights to Kaggle's current policy. This is a Kaggle Dataset modified for practice in cleaning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.
The Github API Terms of Service apply.
You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.
Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.
{
"owner": "pelmers",
"name": "text-rewriter",
"stars": 13,
"forks": 5,
"watchers": 4,
"isFork": false,
"isArchived": false,
"languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
"languageCount": 3,
"topics": [ { "name": "chrome-extension", "stars": 43211 } ],
"topicCount": 1,
"diskUsageKb": 75,
"pullRequests": 4,
"issues": 12,
"description": "Webextension to rewrite phrases in pages",
"primaryLanguage": "JavaScript",
"createdAt": "2015-03-14T22:35:11Z",
"pushedAt": "2022-02-11T14:26:00Z",
"defaultBranchCommitCount": 54,
"license": null,
"assignableUserCount": 1,
"codeOfConduct": null,
"forkingAllowed": true,
"nameWithOwner": "pelmers/text-rewriter",
"parent": null
}
The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.