GitHub Issues & Kaggle Notebooks
Description
GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.
Schmitz005/kaggle-recipe-categorized-chunk-8 dataset hosted on Hugging Face and contributed by the HF Datasets community
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
## Overview
Damaged Roads Alvaro Basily Kaggle is a dataset for object detection tasks - it contains Damaged Roads annotations for 3,321 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Timerkhanov Yuriy
Released under CC0: Public Domain
Some images were collected from Kaggle: https://www.kaggle.com/datasets/dataclusterlabs/suitcaseluggage-dataset
More images were also collected from the following Roboflow Universe projects: * https://universe.roboflow.com/ali-ahmad-kyfzj/baggage-rvbtb * https://universe.roboflow.com/luggage-7rqr6/luggage-kcuiy *****
Animation for Kaggle showing a plume moving across an array of methane laser measurement paths
Animation for Kaggle showing laser path measurements of methane over a plume of methane gas. Reflectors arranged in a fan configuration.
Synthetic transactional data with labels for fraud detection. For more information, see: https://www.kaggle.com/ntnu-testimon/paysim1/version/2
DataSF seeks to transform the way that the City of San Francisco works -- through the use of data.
This dataset contains the following tables: ['311_service_requests', 'bikeshare_stations', 'bikeshare_status', 'bikeshare_trips', 'film_locations', 'sffd_service_calls', 'sfpd_incidents', 'street_trees']
This dataset is deprecated and not being updated.
Fork this kernel to get started with this dataset.
Dataset Source: SF OpenData. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://sfgov.org/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @meric from Unplash.
Which neighborhoods have the highest proportion of offensive graffiti?
Which complaint is most likely to be made using Twitter and in which neighborhood?
What are the most complained about Muni stops in San Francisco?
What are the top 10 incident types that the San Francisco Fire Department responds to?
How many medical incidents and structure fires are there in each neighborhood?
What’s the average response time for each type of dispatched vehicle?
Which category of police incidents have historically been the most common in San Francisco?
What were the most common police incidents in the category of LARCENY/THEFT in 2016?
Which non-criminal incidents saw the biggest reporting change from 2015 to 2016?
What is the average tree diameter?
What is the highest number of a particular species of tree planted in a single year?
Which San Francisco locations feature the largest number of trees?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The DocOrNot dataset contains 50% of images that are pictures, and 50% that are documents. It was built using 8k images from each one of these sources:
RVL CDIP (Small) - https://www.kaggle.com/datasets/uditamin/rvl-cdip-small - license: https://www.industrydocuments.ucsf.edu/help/copyright/ Flickr8k - https://www.kaggle.com/datasets/adityajn105/flickr8k - license: https://creativecommons.org/publicdomain/zero/1.0/
It can be used to train a model and classify an image as being a picture or a… See the full description on the dataset page: https://huggingface.co/datasets/Mozilla/docornot.
RSNA assembled this dataset in 2019 for the RSNA Intracranial Hemorrhage Detection AI Challenge (https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/). De-identified head CT studies were provided by four research institutions. A group of over 60 volunteer expert radiologists recruited by RSNA and the American Society of Neuroradiology labeled over 25,000 exams for the presence and subtype classification of acute intracranial hemorrhage.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23345571%2F4471e4ade50676d782d4787f77aa08ad%2F1000_F_256252609_6WIHRGbpzSaVQwioubxwgXdSJTNONNcK.jpg?generation=1739209341333909&alt=media" alt="">
This dataset contains 2,700 images focused on detecting potholes, cracks, and open manholes on roads. It has been augmented to enhance the variety and robustness of the data. The images are organized into training and validation sets, with three distinct categories:
Included in the Dataset: - Bounding Box Annotations in YOLO Format (.txt files) - Format: YOLOv8 & YOLO11 compatible - Purpose: Ready for training YOLO-based object detection models
Folder Structure Organized into:
Dual Format Support
Use Cases Targeted
Here's a clear breakdown of the folder structure:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23345571%2F023b40c98bf858c58394d6ed2393bfc3%2FScreenshot%202025-05-01%20202438.png?generation=1746109541780835&alt=media" alt="">
Compressed
folder)All of the images are from 100K Dashcam videos. It is collected from the BDD100K dataset.
The images are separated from the videos within 5-second intervals as individual frames.
This dataset contains 10,000 images.
The annotation has been provided in the xlsx
file as well.
The dataset contains 2 classes. They are given below:
Class Representation | Class |
---|---|
y | Collision/Accident |
n | No Collision/No Accident |
https://datacatalog1.worldbank.org/public-licenses?fragment=cchttps://datacatalog1.worldbank.org/public-licenses?fragment=cc
Health Nutrition and Population Statistics database provides key health, nutrition and population statistics gathered from a variety of international and national sources. Themes include global surgery, health financing, HIV/AIDS, immunization, infectious diseases, medical resources and usage, noncommunicable diseases, nutrition, population dynamics, reproductive health, universal health coverage, and water and sanitation.
The data contains client loan data and whether there loan got approved or not. The main goal is to find out loan approval prediction over testing data using model (created using training data)
This dataset contains Food Prices data for Kenya. Food prices data comes from the World Food Programme and covers foods such as maize, rice, beans, fish, and sugar for 76 countries and some 1,500 markets. It is updated weekly but contains to a large extent monthly data. The data goes back as far as 1992 for a few countries, although many countries started reporting from 2003 or thereafter.
This dataset contains 3,400 records of fashion retail sales, capturing various details about customer purchases, including item details, purchase amounts, ratings, and payment methods. It is useful for analyzing customer buying behavior, product popularity, and payment preferences.
Column Name | Data Type | Non-Null Count | Description |
---|---|---|---|
Customer Reference ID | Integer | 3,400 | A unique identifier for each customer. |
Item Purchased | String | 3,400 | The name of the fashion item purchased. |
Purchase Amount (USD) | Float | 2,750 | The purchase price of the item in USD (650 missing values). |
Date Purchase | String | 3,400 | The date on which the purchase was made (format: DD-MM-YYYY). |
Review Rating | Float | 3,076 | The customer review rating (scale: 1 to 5, 324 missing values). |
Payment Method | String | 3,400 | The payment method used (e.g., Credit Card, Cash). |
Purchase Amount (USD)
: 650 missing values Review Rating
: 324 missing values Payment Method
includes multiple categories, allowing analysis of payment trends. Date Purchase
is in DD-MM-YYYY format, which can be useful for time-series analysis. This dataset contains Food Prices data for Pakistan. Food prices data comes from the World Food Programme and covers foods such as maize, rice, beans, fish, and sugar for 76 countries and some 1,500 markets. It is updated weekly but contains to a large extent monthly data. The data goes back as far as 1992 for a few countries, although many countries started reporting from 2003 or thereafter.
GitHub Issues & Kaggle Notebooks
Description
GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.