100+ datasets found
  1. GitHub Public Repository Metadata

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars
    Explore at:
    zip(606866859 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Peter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

    The Github API Terms of Service apply.

    You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

    Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

    Example entry

    {
     "owner": "pelmers",
     "name": "text-rewriter",
     "stars": 13,
     "forks": 5,
     "watchers": 4,
     "isFork": false,
     "isArchived": false,
     "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
     "languageCount": 3,
     "topics": [ { "name": "chrome-extension", "stars": 43211 } ],
     "topicCount": 1,
     "diskUsageKb": 75,
     "pullRequests": 4,
     "issues": 12,
     "description": "Webextension to rewrite phrases in pages",
     "primaryLanguage": "JavaScript",
     "createdAt": "2015-03-14T22:35:11Z",
     "pushedAt": "2022-02-11T14:26:00Z",
     "defaultBranchCommitCount": 54,
     "license": null,
     "assignableUserCount": 1,
     "codeOfConduct": null,
     "forkingAllowed": true,
     "nameWithOwner": "pelmers/text-rewriter",
     "parent": null
    }
    

    The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.

  2. GitHub Dataset

    • kaggle.com
    zip
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Raj (2023). GitHub Dataset [Dataset]. https://www.kaggle.com/nikhil25803/github-dataset
    Explore at:
    zip(79399228 bytes)Available download formats
    Dataset updated
    Mar 2, 2023
    Authors
    Nikhil Raj
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    We have two versions of dataset available

    Version 1 Link

    This dataset is a collection of 1052 GitHub repositories, along with other columns such as the primary language used in it, fork count, open pull requests, and issue count.

    While working on a repository recommendation project, I curated this data by scraping around 18000+ repositories and filtered those that have at least one issue open so that we can recommend the user a repository to which he/she can contribute.

    Columns repositories - the name of the repository (Format - github_username/repository_name) stars_count - stars count of the repository forks_count - fork count of the repository issues_count - active/opened issues in the repository pull_requests - pull requests opened in the repository contributors - contributors contribute to the project so far language - primary language used in the project

    Version 2 Link

    Found a JSON data on Kaggle, (link) and wrote a preprocessing function to convert them into a CSV file.

    This is comparatively a bigger dataset, with 2917951 repositories data.

    Columns name - the name of the repository stars_count - stars count of the repository forks_count - forks count of the repository watchers - watchers in the repository pull_requests - pull requests made in the repository primary_language - the primary language of the repository languages_used - list of all the languages used in the repository commit_count - commits made in the repository created_at - time and date when the repository was created license - license assigned to the repository.

    Note The data in the dataset is from the time when it was scrapped, so any updates in the actual repository will not be reflected here.

  3. spinning-up-sac-ant-results

    • kaggle.com
    zip
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MoniGarrr (2025). spinning-up-sac-ant-results [Dataset]. https://www.kaggle.com/datasets/monigarrr/spinning-up-sac-ant-results
    Explore at:
    zip(2209 bytes)Available download formats
    Dataset updated
    Jul 20, 2025
    Authors
    MoniGarrr
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SAC Ant Results: OpenAI Spinning Up (Modernized PPO/SAC Baseline) Repository: github.com/monigarr/spinningup/tree/monigarr-dev

    This dataset contains structured results from reinforcement learning (RL) experiments using the Soft Actor-Critic (SAC) algorithm on the Ant-v5 environment. It is part of MoniGarr's initiative to modernize and extend the original OpenAI Spinning Up codebase for current Python, PyTorch, and Gymnasium ecosystems.

    The results include detailed logs of reward progression, hyperparameter configurations, evaluation summaries, and visualizations, all generated through reproducible experimentation using an updated and extensible RL workflow.

    DATASET CONTENTS: File Name Description sac_ant_results.csv Epoch-level log of training rewards, timesteps, and key metrics sac_config.json Full configuration used for the SAC training run sac_eval_metrics.json Summary of evaluation metrics including reward and return sac_training_plot.png Reward curve visualization (training performance over time) experiment_notes.md Key observations and tuning notes from the experiments

    METHODOLOGY - Cloned and refactored OpenAI’s Spinning Up repo - Replaced deprecated gym with gymnasium - Updated SAC implementation for compatibility with PyTorch 2.x - Ran long-horizon training on Ant-v5 with multiple seeds and checkpoints - Used custom logging for exportable CSV/JSON format results

    INTENDED USES This dataset supports: * Baseline reproduction and RL benchmarking * Curriculum development in deep reinforcement learning * Comparative analysis of SAC vs. PPO/TD3 * Applied research, debugging, and educational tutorials

    WHY THIS DATASET IS USEFUL Maintaining parity with evolving RL tools is essential for ensuring reproducibility and learning efficiency. This dataset: * Demonstrates SAC performance under modern configurations * Offers ready-to-use logs and plots for analysis and reporting * Enables faster experimentation for RL students and developers

    PROJECT CONTEXT This work is part of MoniGarr's larger suite of open-source AI efforts focused on: * Modernizing legacy ML frameworks * Promoting accessible, well-documented reinforcement learning pipelines * Supporting low-resource developers and researchers with reproducible tools

    GEOSPATIAL COVERAGE - Primary Location: Akwesasne NY, Akwesasne Ontario - Extended Context: Worldwide (Open-source reproducibility) - The dataset was generated in Akwesasne but it's intended for worldwide use in reproducible RL research and education. Since the data is synthetic and code-driven, there's no human subject or location-bound data involved.

    ASSOCIATED PAPERS & SOURCES This dataset builds upon and modernizes results from:

    SPINNING UP IN DEEP RL : OpenAI GitHub: https://github.com/openai/spinningup Paper: https://spinningup.openai.com/en/latest/spinningup.pdf

    SOFT ACTOR-CRITIC ALGORITHMS Haarnoja et al., 2018 Paper: https://arxiv.org/abs/1801.01290 SAC Code Reference: https://github.com/denisyarats/pytorch_sac

    EXPECTED UPDATE FREQUENCY Initial Release: Complete Updates: Occasional — only if benchmark improvements, environment changes, or additional baseline comparisons (e.g., TD3, PPO-Penalty) are added. Community Contributions: Welcome via GitHub PRs and issues.

    RECOMMENDED COVERAGE - Reinforcement Learning education and experimentation - Benchmarking reproducible SAC performance on Ant-v5 - Use in papers, blogs, notebooks, or reproducibility studies - Modern RL code comparisons (Gym → Gymnasium, legacy → PyTorch 2.x)

    If you find the dataset helpful, feel free to ⭐️ the repo or connect with @MoniGarr. https://github.com/monigarr/spinningup/tree/monigarr-dev

  4. h

    git-diff_to_commit_msg

    • huggingface.co
    • kaggle.com
    Updated Oct 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Epasinghe (2025). git-diff_to_commit_msg [Dataset]. https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg
    Explore at:
    Dataset updated
    Oct 5, 2025
    Authors
    Epasinghe
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Hi, I’m Seniru Epasinghe 👋

    I’m an AI undergraduate and an AI enthusiast, working on machine learning projects and open-source contributions.I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.

      🌐 Connect with me
    
    
    
    
    
    
    
    
    
    
      There are 2 version of this dataset:
    

    git-diff_to_commit_msg - 1.5K rows huggingface link kaggle link

    git-diff_to_commit_msg_large - 1.75M rows huggingface link kaggle link… See the full description on the dataset page: https://huggingface.co/datasets/seniruk/git-diff_to_commit_msg.

  5. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  6. git-diff_to_commit_msg_large

    • kaggle.com
    • huggingface.co
    zip
    Updated Oct 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    seniru epasinghe (2025). git-diff_to_commit_msg_large [Dataset]. https://www.kaggle.com/datasets/seniruepasinghe/git-diff-to-commit-msg-large
    Explore at:
    zip(433820173 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    seniru epasinghe
    Description

    There are 2 version of this dataset:

    This is the git-diff_to_commit_msg_large - 1.75M rows version

    Custom dataset created with commits from public repositories and also adding chatGPT-4o generated commit message for each commit

    Consist of 15 programming languages

    java, ruby, go, javascript, python, php, html, css, sql, typescript, C#, C++, XML, rust, swift

    Can be used to train models, finetune LLMs or for validation purposes

  7. Google Landmarks Dataset v2

    • github.com
    • opendatalab.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
    Explore at:
    Dataset updated
    Sep 27, 2019
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.

  8. h

    FER-datasets

    • huggingface.co
    Updated Nov 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minh-Thien Nguyen (2025). FER-datasets [Dataset]. https://huggingface.co/datasets/minhnguyent546/FER-datasets
    Explore at:
    Dataset updated
    Nov 1, 2025
    Authors
    Minh-Thien Nguyen
    Description

    FER Datasets

    Dataset Link

    FER2013 kaggle

    CAER-S caer-dataset.github.io

    AffectNet kaggle

    RAF-DB kaggle

  9. h

    UQABench

    • huggingface.co
    • kaggle.com
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenStellarTeam (2025). UQABench [Dataset]. https://huggingface.co/datasets/OpenStellarTeam/UQABench
    Explore at:
    Dataset updated
    Oct 6, 2025
    Authors
    OpenStellarTeam
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    [KDD'25] UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering [KDD 2025 Accepted (Oral) Paper]

      Overview
    

    The paper link: UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering. Github: https://github.com/OpenStellarTeam/UQABench The source data (Kaggle): Kaggle

      Description
    

    The UQABench is a benchmark dataset for evaluating user embeddings in prompting LLMs for personalized question answering. The… See the full description on the dataset page: https://huggingface.co/datasets/OpenStellarTeam/UQABench.

  10. Space X to Y Data Analysis & Landing Prediction

    • kaggle.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Britta Smith (2023). Space X to Y Data Analysis & Landing Prediction [Dataset]. https://www.kaggle.com/brittasmith/spacextoy-dataanalysis-launchprediction
    Explore at:
    zip(5536699 bytes)Available download formats
    Dataset updated
    Jan 29, 2023
    Authors
    Britta Smith
    Description

    GitHub Project Link: Space X to Y

    Peer Audience Presentation Slides: See PDF uploaded below

    Tableau Dashboard Link: Space X to Y

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10293677%2Fa6d81c06dc03412bfd063941bd1dfa18%2Fspacex-falcon9-reaching-orbit-wide.jpg?generation=1672337964521833&alt=media" alt="">

  11. Machine Learning users on Github

    • kaggle.com
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    prosper chuks (2022). Machine Learning users on Github [Dataset]. https://www.kaggle.com/prosperchuks/machine-learning-users-on-github
    Explore at:
    zip(52282 bytes)Available download formats
    Dataset updated
    Jan 9, 2022
    Authors
    prosper chuks
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Data was scraped from Github's API.

    Columns

    LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user

    This dataset contains over 600 users from Lagos, Nigeria and Rwanda

    Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data

  12. GitHub Social Network

    • kaggle.com
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gitanjali Wadhwa (2023). GitHub Social Network [Dataset]. https://www.kaggle.com/datasets/gitanjali1425/github-social-network-graph-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gitanjali Wadhwa
    Description

    Description

    An extensive social network of GitHub developers was collected from the public API in June 2019. Nodes are developers who have starred at most minuscule 10 repositories, and edges are mutual follower relationships between them. The vertex features are extracted based on the location; repositories starred, employer and e-mail address. The task related to the graph is binary node classification - one has to predict whether the GitHub user is a web or a machine learning developer. This targeting feature was derived from the job title of each user.

    Properties

    • Directed: No.
    • Node features: Yes.
    • Edge features: No.
    • Node labels: Yes. Binary-labeled.
    • Temporal: No.
    • Nodes: 37,700
    • Edges: 289,003
    • Density: 0.001
    • Transitvity: 0.013

    Possible Tasks

    • Binary node classification
    • Link prediction
    • Community detection
    • Network visualisation
  13. GitHub Topics Star Count

    • kaggle.com
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jishnu (2022). GitHub Topics Star Count [Dataset]. https://www.kaggle.com/datasets/jishnukoliyadan/github-topics-star-count
    Explore at:
    zip(421706 bytes)Available download formats
    Dataset updated
    Nov 28, 2022
    Authors
    Jishnu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About the data

    • This is a scrapped dataset from https://github.com/topics.
    • Contains list of topics title, user name, repository name, link to the repository and the star count.
    • For each topic tried to scrapped top 120 GitHub repository information.
    • The scrapped data is made available to public make use of it and appreciated the GitHub users for their contribution to Open Source.

    How the data is collected ?

    • The data was collected with the help of Selenium and BeautifulSoup libraries in Python.
    • The web scrapping notebook is name available here for any reference for the future web scrapping projects.

    Image Credit : Roman Synkevych

  14. Event Data of Popular Open Source Go Repositories

    • kaggle.com
    zip
    Updated Jul 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentina A (2025). Event Data of Popular Open Source Go Repositories [Dataset]. https://www.kaggle.com/datasets/valentinaask/event-data-of-popular-open-source-go-repositories/data
    Explore at:
    zip(2824594791 bytes)Available download formats
    Dataset updated
    Jul 13, 2025
    Authors
    Valentina A
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    The dataset contains event data from the 100 most popular GitHub repositories using the Go programming language as the basis, as of 2024. The hash of the commit is supplemented with the unique identifier of each action. It is linked to all the data provided by Github that connected with particular commit.

    Logical Data Model https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7147053%2F69a52e8816b226a0fe6cec90e858d576%2F111.jpg?generation=1752428663053328&alt=media" alt="Logical Model">

  15. Github Repos

    • kaggle.com
    zip
    Updated Mar 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhneshd Dhingra (2022). Github Repos [Dataset]. https://www.kaggle.com/datasets/dhneshddhingra/github-repos/discussion
    Explore at:
    zip(8586547 bytes)Available download formats
    Dataset updated
    Mar 8, 2022
    Authors
    Dhneshd Dhingra
    Description

    Description Over half a million records for Github OpenSource Projects

    About Dataset Dataset Includes Github Repo Link, Its Stars, Forks Count, Issues Count, Languages used etc

  16. Arxiv NLP papers with Github link

    • kaggle.com
    zip
    Updated Jan 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shujian Liu (2019). Arxiv NLP papers with Github link [Dataset]. https://www.kaggle.com/datasets/shujian/arxiv-nlp-papers-with-github-link
    Explore at:
    zip(45702 bytes)Available download formats
    Dataset updated
    Jan 24, 2019
    Authors
    Shujian Liu
    Description

    Dataset

    This dataset was created by Shujian Liu

    Contents

  17. iterative-stratification

    • kaggle.com
    zip
    Updated Oct 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wojciech "Victor" Fulmyk (2022). iterative-stratification [Dataset]. https://www.kaggle.com/datasets/wisawesome/iterativestratification
    Explore at:
    zip(74738 bytes)Available download formats
    Dataset updated
    Oct 30, 2022
    Authors
    Wojciech "Victor" Fulmyk
    Description

    iterative-stratification

    From: https://github.com/trent-b/iterative-stratification

    Downloaded using: sudo apt-get install git-lfs git lfs install

    git clone https://github.com/trent-b/iterative-stratification

    The objective of this upload: - use the included Python library in Kaggle competitions without needing to connect to the internet.

    There is no intention to infringe rights of any kind on my part, I simply want to use this library in competitions that require no internet connection. If you are one of the rights holders for this library and you feel your rights are being infringed by this upload, please contact me and I will rectify the issue as soon as possible.

    The original source code, of which this is an exact copy, was relased under the BSD-3-Clause license. See the github repository for details.

  18. AI Agents Dataset — GitHub Repositories, Use Cases

    • kaggle.com
    zip
    Updated Oct 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alam Shihab (2025). AI Agents Dataset — GitHub Repositories, Use Cases [Dataset]. https://www.kaggle.com/datasets/alamshihab075/ai-agents-dataset-github-repositories-use-cases
    Explore at:
    zip(4554 bytes)Available download formats
    Dataset updated
    Oct 22, 2025
    Authors
    Alam Shihab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AI Agents Dataset — GitHub Repositories, Use Cases

    Subtitle

    Curated list of AI agent use-cases with direct links to GitHub implementations.

    Overview

    A compact, curated CSV of AI agent projects with a short title, industry tag, one-line description, and a direct GitHub link. The dataset is intended as a discovery and enrichment seed for engineers, researchers, and educators looking for real-world agent code examples.

    Dataset Summary

    StatisticValue
    Records71
    ColumnsUse Case, Industry, Description, Code Github

    How It Was Collected

    Entries were gathered from publicly available GitHub repositories and project examples. Links may point to full repos, notebooks, or subfolders and were captured as-is.

    Suggested Uses

    This dataset is mostly useful for RAG based systems to automate AI Agent building or coding.

    Minimal Notes & Limitations

    • Link Stability: Links were valid at curation time but can change — verify before automated use.
    • License Heterogeneity: Each linked repo has its own license; respect those licenses when using code.
    • Not a Complete Corpus: This is a metadata seed (short descriptions + links). For training or indexing, fetch full README and source files to enrich the dataset.

    License

    This dataset is provided under CC0 1.0 Universal (public domain). Linked repositories are governed by their own licenses.

  19. CRACK500-20220509T090436Z-001

    • kaggle.com
    zip
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul David22 (2022). CRACK500-20220509T090436Z-001 [Dataset]. https://www.kaggle.com/datasets/pauldavid22/crack50020220509t090436z001
    Explore at:
    zip(1643195204 bytes)Available download formats
    Dataset updated
    May 9, 2022
    Authors
    Paul David22
    Description

    Original Source Link

    Below Excerpts taken from the above Github Repo

    Pavement crack detection: dataset and model

    The project is used to share our recent work on pavement crack detection. For the details of the work, the readers are refer to the paper "Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection" (FPHB), T-ITS 2019. You can find the paper in https://www.researchgate.net/publication/330244656_Feature_Pyramid_and_Hierarchical_Boosting_Network_for_Pavement_Crack_Detection or https://arxiv.org/abs/1901.06340.

    The pavement crack datasets used in paper, crack detection results on each datasets, trained model, and crack annotation tool are stored in Google Drive, One Drive, and Daidu Yunpan extract code: jviq.

  20. KaggleMoviesDataSetCleaned

    • kaggle.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S Hooper (2025). KaggleMoviesDataSetCleaned [Dataset]. https://www.kaggle.com/susannahooper/kagglemoviesdatasetcleaned
    Explore at:
    zip(11192292 bytes)Available download formats
    Dataset updated
    Feb 21, 2025
    Authors
    S Hooper
    Description

    Link to the Kaggle Movies Data Set cleaned with the process specified in the notebook at this GItHub link: https://github.com/shoopy7/shoopy7/blob/main/notebooks/KaggleMoviesDatasetForFeatureExtractionPredictionNtbk.ipynb

    Rights to Kaggle's current policy. This is a Kaggle Dataset modified for practice in cleaning.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Peter (2025). GitHub Public Repository Metadata [Dataset]. https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars
Organization logo

GitHub Public Repository Metadata

Metadata (i.e. no code) of all public repositories with 5+ stars on GitHub

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
zip(606866859 bytes)Available download formats
Dataset updated
Oct 26, 2025
Authors
Peter
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is obtained from the Github API and contains only public repository-level metadata. It may be useful for anyone interested in studying the Github ecosystem. It contains approximately 3.1 million entries.

The Github API Terms of Service apply.

You may not use this dataset for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

Please see the sample exploration notebook for some examples of what you can do! The data format is a JSON array of entries, an example of which is given below.

Example entry

{
 "owner": "pelmers",
 "name": "text-rewriter",
 "stars": 13,
 "forks": 5,
 "watchers": 4,
 "isFork": false,
 "isArchived": false,
 "languages": [ { "name": "JavaScript", "size": 21769 }, { "name": "HTML", "size": 2096 }, { "name": "CSS", "size": 2081 } ],
 "languageCount": 3,
 "topics": [ { "name": "chrome-extension", "stars": 43211 } ],
 "topicCount": 1,
 "diskUsageKb": 75,
 "pullRequests": 4,
 "issues": 12,
 "description": "Webextension to rewrite phrases in pages",
 "primaryLanguage": "JavaScript",
 "createdAt": "2015-03-14T22:35:11Z",
 "pushedAt": "2022-02-11T14:26:00Z",
 "defaultBranchCommitCount": 54,
 "license": null,
 "assignableUserCount": 1,
 "codeOfConduct": null,
 "forkingAllowed": true,
 "nameWithOwner": "pelmers/text-rewriter",
 "parent": null
}

The collection script and exploration notebook are also available on Github: https://github.com/pelmers/github-repository-metadata. For more background info, you can read my blog post.

Search
Clear search
Close search
Google apps
Main menu