https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Budi Ryan
Released under CC0: Public Domain
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A dataset for the 1st task Explain or teach basic data science concepts
of the competition Google – AI Assistants for Data Tasks with Gemma.
This dataset contains several glossaries of Data Science, where every sample contains two keys term(vocab name) and definition.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for [LLM Science Exam Kaggle Competition]
Dataset Summary
https://www.kaggle.com/competitions/kaggle-llm-science-exam/data
Languages
[en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]
Dataset Structure
Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
This dataset contains crop types of agricultural fields in four states of Uttar Pradesh, Rajasthan, Odisha and Bihar in northern India. There are 13 different classes in the dataset including Fallow land and 12 crop types of Wheat, Mustard, Lentil, Green pea, Sugarcane, Garlic, Maize, Gram, Coriander, Potato, Bersem, and Rice. The dataset is split to train and test collections as part of the AgriFieldNet India Competition. Ground reference data for this dataset is collected by IDinsight’s Data on Demand team. Radiant Earth Foundation carried out the training dataset curation and publication. This training dataset is generated through a grant from the Enabling Crop Analytics at Scale (ECAAS) Initiative funded by The Bill & Melinda Gates Foundation and implemented by Tetra Tech.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Original data from Predict Future Sales (Kaggle Competition) Translated items_categories.csv, shops.csv, items.csv from Russian to English for easy features engineering and references.
Translated item description and shop name from Russian to English items.csv - supplemental information about the items/products. item_categories.csv - supplemental information about the items categories. shops.csv- supplemental information about the shops.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compressed (.zip) archive containing the data set in .csv format and a README.txt file explaining the columns. (ZIP)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.
Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.
How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.
Dataset Structure:
The dataset consists of three main files, each with its specific role:
Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).
https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db
Test2:
The test dataset mirrors the structure of train.csv
but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.
https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.
https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db
Id: A unique identifier for each (Store, Date) combination within the test set.
Store: A unique identifier for each store.
Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).
Customers: The number of customers visiting the store on a given day.
Open: An indicator of whether the store was open (1 = open, 0 = closed).
StateHoliday: Indicates if the day is a state holiday, with values like:
'a' = public holiday,
'b' = Easter holiday,
'c' = Christmas,
'0' = no holiday.
SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).
StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.
Assortment: Describes the level of product assortment in the store:
'a' = basic,
'b' = extra,
'c' = extended.
CompetitionDistance: Distance (in meters) to the nearest competitor store.
CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.
Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).
Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).
Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.
PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.
To work with this dataset, you will need to have specific software installed, including:
DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.
Python Libraries: Key libraries for working with the dataset include:
pandas
for data manipulation,
numpy
for numerical operations,
matplotlib
and seaborn
for data visualization,
scikit-learn
for machine learning algorithms.
Several additional resources are available for working with the dataset:
Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.
Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb
, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.
Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.
Trained Models (.pkl files):
The models trained during the project are saved as .pkl
files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.
sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv
contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.
These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source: [9].
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A cleaned version of
Competitions.csv
focused on timeline analysis.✅ Includes:
CompetitionId
,Title
,Deadline
,EnabledDate
,HostSegmentTitle
✅ Helps understand growth over time, and regional hosting focus ✅ Can be joined withteams_clean.csv
anduser_achievements_clean.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This multi-city human mobility dataset contains data from 4 metropolitan areas (cities A, B, C, D), somewhere in Japan. Each city is divided into 500 meters x 500 meters cells, which span a 200 x 200 grid. The human mobility datasets contain the movement of individuals across a 75-day period, discretized into 30-minute intervals and 500-meter grid cells. Each city contains the movement data of 100,000, 25,000, 20,000, and 6,000 individuals, respectively.
While the name or location of the city is not disclosed, the participants are provided with points-of-interest (POIs; e.g., restaurants, parks) data for each grid cell (~85 dimensional vector) for the four cities as supplementary information (e.g., POIdata_cityA). The list of 85 POI categories can be found in POI_datacategories.csv.
This dataset was used for the HuMob Data Challenge 2024 competition. For more details, see https://wp.nyu.edu/humobchallenge2024/
Researchers may use this dataset for publications and reports, as long as: 1) Users shall not carry out activities that involve unethical usage of the data, including attempts at re-identifying data subjects, harming individuals, or damaging companies, and 2) The Data Descriptor paper of an earlier version of the dataset (citation below) needs to be cited when using the data for research and/or commercial purposes. Downloading this dataset implies agreement with the above two conditions.
This data contains movement information generated from user location data obtained from LY Corporation smartphone applications. It does not reveal the actual timestamp, latitude, longitude, etc., and does not identify individuals. This data can only be used for the purpose of participating in the Humob Challenge 2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Direct observations of the oceans acquired on oceanographic research ships operated across the international community support fundamental research into the many disciplines of ocean science and provide essential information for monitoring the health of the oceans. A comprehensive knowledge base is needed to support the responsible stewardship of the oceans with easy access to all data acquired globally. In the United States, the multidisciplinary shipboard sensor data routinely acquired each year on the fleet of coastal, regional and global ranging vessels supporting academic marine research are managed by the Rolling Deck to Repository (R2R, rvdata.us) program. With over a decade of operations, the R2R program has developed a robust routinized system to transform diverse data contributions from different marine data providers into a standardized and comprehensive collection of global-ranging observations of marine atmosphere, ocean, seafloor and subseafloor properties that is openly available to the international research community. In this article we describe the elements and framework of the R2R program and the services provided. To manage all expeditions conducted annually, a fleet-wide approach has been developed using data distributions submitted from marine operators with a data management workflow designed to maximize automation of data curation. Other design goals are to improve the completeness and consistency of the data and metadata archived, to support data citability, provenance tracking and interoperable data access aligned with FAIR (findable, accessible, interoperable, reusable) recommendations, and to facilitate delivery of data from the fleet for global data syntheses. Findings from a collection-level review of changes in data acquisition practices and quality over the past decade are presented. Lessons learned from R2R operations are also discussed including the benefits of designing data curation around the routine practices of data providers, approaches for ensuring preservation of a more complete data collection with a high level of FAIRness, and the opportunities for homogenization of datasets from the fleet so that they can support the broadest re-use of data across a diverse user community.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock the full potential of your data-driven projects with our comprehensive Grainger products dataset. This meticulously curated dataset includes detailed information on a wide range of products available on Grainger, one of the leading industrial supply companies.
This dataset is perfect for eCommerce platforms, market analysis, competitive analysis, product comparison, and more. Leverage the power of high-quality, structured data to enhance your business strategies and decision-making processes.
Versions:
Available latest version of the Grainger dataset with 1.2 Million records and last extracted on Jan 2025.
Reach out to contact@crawlfeeds.com
Use Cases:
Explore the vast collection of Grainger products and elevate your business insights with this high-quality dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains match score data from major international competitions across 12 team ball sports: basketball, cricket, field hockey, futsal, handball, ice hockey, lacrosse, roller hockey, rugby, soccer, volleyball, and water polo. The dataset was obtained by web scraping data available on Wikipedia pages and includes, for each sport, the following information related to individual matches: the year of the competition edition when a match occurred, the names of the two opposing teams, their respective scores, and the name of the winning team.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).
The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.
The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.
The dataset has a tabular structure and was initially stored in CSV format. It contains:
Rows: 7,043 customer records
Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).
Naming Convention:
The table in the database is named telco_customer_churn_data
.
Software Requirements:
To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).
For machine learning applications, libraries such as pandas
, scikit-learn
, and joblib
are typically used.
Additional Resources:
Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn
When reusing the dataset, users should be aware:
Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).
Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.
This dataset contains field boundaries for smallholder farms in eastern Rwanda. The Nasa Harvest program funded a team of annotators from TaQadam to label Planet imagery for the 2021 growing season for the purpose of conducting the Rwanda Field boundary detection Challenge. The dataset includes rasterized labeled field boundaries and time series satellite imagery from Planet's NICFI program. Planet's basemap imagery is provided for six months (March, April, August, October, November and December). The paired dataset is provided in 256x256 chips for a total of 70 tiles covering 1532 individual fields.
Input imagery consists of a time series of planet Basemaps from the NICFI program (monthly composite) data.
Imagery Copyright 2021 Planet Labs Inc. All use subject to the Participant License Agreement.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
a) Fully labeled images (i.e. the image has all the trees delineated and each polygon has species information)
b) Partially labeled images (i.e. the image has only some trees delineated, and each polygon has species information)
No. | Dataset name | Training images | Validation images | Fully labeled | Partially labeled |
1 | 12_RGB5cm_FullyLabeled | 1066 | 304 | x | |
2 | ObjectDetection_TreeSpecies | 422 | 84 | x | |
3 | 34_RGB_all_L_PascalVoc_640Mask | 951 | 272 | x | |
4 | 34_RGB_PartiallyLabeled640 | 917 | 262 | x |
Data presented here are subset of a larger plankton imagery data set collected in the subtropical Straits of Florida from 2014-05-28 to 2014-06-14. Imagery data were collected using the In Situ Ichthyoplankton Imaging System (ISIIS-2) as part of a NSF-funded project to assess the biophysical drivers affecting fine-scale interactions between larval fish, their prey, and predators. This subset of images was used in the inaugural National Data Science Bowl (www.datasciencebowl.com) hosted by Kaggle and sponsored by Booz Allen Hamilton. Data were originally collected to examine the biophysical drivers affecting fine-scale (spatial) interactions between larval fish, their prey, and predators in a subtropical pelagic marine ecosystem. Image segments extracted from the raw data were sorted into 121 plankton classes, split 50:50 into train and test data sets, and provided for a machine learning competition (the National Data Science Bowl). There was no hierarchical relationships explicit in the 121 plankton classes, though the class naming convention and a tree-like diagram (see file "Plankton Relationships.pdf") indicated relationships between classes, whether it was taxonomic or structural (size and shape). We intend for this dataset to be available to the machine learning and computer vision community as a standard machine learning benchmark. This “Plankton 1.0†dataset is a medium-size dataset with a fair amount of complexity where image classification improvements can still be made.
This is just a copy of https://www.kaggle.com/c/kaggle-survey-2020/data.
It is hosted as a public Kaggle dataset instead of a Kaggle Competition Dataset to make it easier for users to find via search (we have different search menus for public datasets vs competition datasets and you have to know which one to use but it is confusing so hopefully this makes the data easier to find).
You are probably looking for https://www.kaggle.com/c/kaggle-survey-2020/data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FAIR-CSAR-V1.0 dataset, constructed on single-look complex (SLC) images of Gaofen-3 satellite, is the largest and most finely annotated SAR image dataset for fine-grained target to date. FAIR-CSAR-V1.0 aims to advance related technologies in SAR image object detection, recognition, and target characteristic understanding. The dataset is developed by Key Laboratory of Target Cognition and Application Technology (TCAT) at the Aerospace Information Research Institute, Chinese Academy of Sciences.FAIR-CSAR-V1.0 comprises 175 scenes of Gaofen-3 Level-1 SLC products, covering 32 global regions including airports, oil refineries, ports, and rivers. With a total data volume of 250 GB and over 340,000 instances, FAIR-CSAR-V1.0 covers 5 main categories and 22 subcategories, providing detailed annotations for imaging parameters (e.g., radar center frequency, pulse repetition frequency) and target characteristics (e.g., satellite-ground relative azimuthal angle, key scattering point distribution).FAIR-CSAR-V1.0 consists of two sub-datasets: the SL dataset and the FSI dataset. The SL dataset, acquired in spotlight mode with a nominal resolution of 1 meter, contains 170,000 instances across 22 target classes. The FSI dataset, acquired in fine stripmap mode with a nominal resolution of 5 meters, includes 170,000 instances across 3 target classes. Figure 1 presents an overview of the dataset.Data paper and citation format:[1] Youming Wu, Wenhui Diao, Yuxi Suo, Xian Sun. A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images (FAIR-CSAR-V1.0) [OL]. Journal of Radars, 2025. https://radars.ac.cn/web/data/getData?dataType=FAIR_CSAR_en&pageType=en.[2] Y. Wu, Y. Suo, Q. Meng, W. Dai, T. Miao, W. Zhao, Z. Yan, W. Diao, G. Xie, Q. Ke, Y. Zhao, K. Fu and X. Sun, FAIR-CSAR: A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images[J]. IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1-22, 2025, doi: 10.1109/TGRS.2024.3519891.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Budi Ryan
Released under CC0: Public Domain