100+ datasets found

How to Win Data Science Competition
kaggle.com
zip
Updated Jan 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
Explore at:
zip(15845091 bytes)Available download formats
Dataset updated
Jan 30, 2018
Authors
Budi Ryan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Budi Ryan

Released under CC0: Public Domain

Contents
Data Science Glossary For QA
kaggle.com
Updated Mar 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofianesun (2024). Data Science Glossary For QA [Dataset]. https://www.kaggle.com/datasets/sofianesun/data-science-glossary-for-qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 8, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sofianesun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A dataset for the 1st task Explain or teach basic data science concepts of the competition Google – AI Assistants for Data Tasks with Gemma. This dataset contains several glossaries of Data Science, where every sample contains two keys term(vocab name) and definition.
h
Kaggle-LLM-Science-Exam
huggingface.co
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sangeetha Venkatesan (2023). Kaggle-LLM-Science-Exam [Dataset]. https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2023
Authors
Sangeetha Venkatesan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for [LLM Science Exam Kaggle Competition]

Dataset Summary

https://www.kaggle.com/competitions/kaggle-llm-science-exam/data

Languages

[en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]

Dataset Structure

Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.
Meta Kaggle Code
kaggle.com
zip
Updated Aug 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(154608704973 bytes)Available download formats
Dataset updated
Aug 28, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
n
AgriFieldNet Competition Dataset
cmr.earthdata.nasa.gov
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). AgriFieldNet Competition Dataset [Dataset]. http://doi.org/10.34911/rdnt.wu92p1
Explore at:
Unique identifier
https://doi.org/10.34911/rdnt.wu92p1
Dataset updated
Oct 10, 2023
Time period covered
Jan 1, 2020 - Jan 1, 2023
Area covered

Description
This dataset contains crop types of agricultural fields in four states of Uttar Pradesh, Rajasthan, Odisha and Bihar in northern India. There are 13 different classes in the dataset including Fallow land and 12 crop types of Wheat, Mustard, Lentil, Green pea, Sugarcane, Garlic, Maize, Gram, Coriander, Potato, Bersem, and Rice. The dataset is split to train and test collections as part of the AgriFieldNet India Competition. Ground reference data for this dataset is collected by IDinsight’s Data on Demand team. Radiant Earth Foundation carried out the training dataset curation and publication. This training dataset is generated through a grant from the Enabling Crop Analytics at Scale (ECAAS) Initiative funded by The Bill & Melinda Gates Foundation and implemented by Tetra Tech.
Predict Future Sales (translated to English)
kaggle.com
Updated Nov 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YWenLin (2020). Predict Future Sales (translated to English) [Dataset]. https://www.kaggle.com/datasets/ywhenlyn/predict-future-sales-translated-to-english/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
YWenLin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Original data from Predict Future Sales (Kaggle Competition) Translated items_categories.csv, shops.csv, items.csv from Russian to English for easy features engineering and references.

File Information

Translated item description and shop name from Russian to English items.csv - supplemental information about the items/products. item_categories.csv - supplemental information about the items categories. shops.csv- supplemental information about the shops.

Column Description

ID - an Id that represents a (Shop, Item) tuple within the test set

shop_id - unique identifier of a shop

item_id - unique identifier of a product

item_name - name of item

shop_name - name of shop

item_category_name - name of item category
Dataset of all user solutions and actions in the experiment.
plos.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devon Brackbill; Damon Centola (2023). Dataset of all user solutions and actions in the experiment. [Dataset]. http://doi.org/10.1371/journal.pone.0237978.s012
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0237978.s012
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Devon Brackbill; Damon Centola
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compressed (.zip) archive containing the data set in .csv format and a README.txt file explaining the columns. (ZIP)
t
Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...
test.researchdata.tuwien.ac.at
bin, csv, json +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
Explore at:
csv, text/markdown, json, binAvailable download formats
Unique identifier
https://doi.org/10.70124/f5t2d-xt904
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2025
Description
Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db

Test2:
The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9

Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

Data Fields Description:

Id: A unique identifier for each (Store, Date) combination within the test set.

Store: A unique identifier for each store.

Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

Customers: The number of customers visiting the store on a given day.

Open: An indicator of whether the store was open (1 = open, 0 = closed).

StateHoliday: Indicates if the day is a state holiday, with values like:

'a' = public holiday,

'b' = Easter holiday,

'c' = Christmas,

'0' = no holiday.

SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

Assortment: Describes the level of product assortment in the store:

'a' = basic,

'b' = extra,

'c' = extended.

CompetitionDistance: Distance (in meters) to the nearest competitor store.

CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

Python Libraries: Key libraries for working with the dataset include:

pandas for data manipulation,

numpy for numerical operations,

matplotlib and seaborn for data visualization,

scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

Trained Models (.pkl files):
The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
f
Data characteristics for the Kaggle.com seizure forecasting contest.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo (2023). Data characteristics for the Kaggle.com seizure forecasting contest. [Dataset]. http://doi.org/10.1371/journal.pone.0178808.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0178808.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Source: [9].
Meta_Kaggle_Competitions_cleaned_dataset
kaggle.com
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarvpreet Kaur (2025). Meta_Kaggle_Competitions_cleaned_dataset [Dataset]. https://www.kaggle.com/datasets/sarvpreetkaur22/meta-kaggle-competitions-cleaned-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sarvpreet Kaur
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📝 Description:

A cleaned version of Competitions.csv focused on timeline analysis.

✅ Includes: CompetitionId, Title, Deadline, EnabledDate, HostSegmentTitle ✅ Helps understand growth over time, and regional hosting focus ✅ Can be joined with teams_clean.csv and user_achievements_clean.csv
LYMob-4Cities: Multi-City Human Mobility Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takahiro Yabe; Takahiro Yabe; Kota Tsubouchi; Kota Tsubouchi; Toru Shimizu; Toru Shimizu (2025). LYMob-4Cities: Multi-City Human Mobility Dataset [Dataset]. http://doi.org/10.5281/zenodo.14219563
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14219563
Dataset updated
Jun 17, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Takahiro Yabe; Takahiro Yabe; Kota Tsubouchi; Kota Tsubouchi; Toru Shimizu; Toru Shimizu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This multi-city human mobility dataset contains data from 4 metropolitan areas (cities A, B, C, D), somewhere in Japan. Each city is divided into 500 meters x 500 meters cells, which span a 200 x 200 grid. The human mobility datasets contain the movement of individuals across a 75-day period, discretized into 30-minute intervals and 500-meter grid cells. Each city contains the movement data of 100,000, 25,000, 20,000, and 6,000 individuals, respectively.

While the name or location of the city is not disclosed, the participants are provided with points-of-interest (POIs; e.g., restaurants, parks) data for each grid cell (~85 dimensional vector) for the four cities as supplementary information (e.g., POIdata_cityA). The list of 85 POI categories can be found in POI_datacategories.csv.

This dataset was used for the HuMob Data Challenge 2024 competition. For more details, see https://wp.nyu.edu/humobchallenge2024/

Researchers may use this dataset for publications and reports, as long as: 1) Users shall not carry out activities that involve unethical usage of the data, including attempts at re-identifying data subjects, harming individuals, or damaging companies, and 2) The Data Descriptor paper of an earlier version of the dataset (citation below) needs to be cited when using the data for research and/or commercial purposes. Downloading this dataset implies agreement with the above two conditions.

Yabe, T., Tsubouchi, K., Shimizu, T., Sekimoto, Y., Sezaki, K., Moro, E., & Pentland, A. (2024). YJMob100K: City-scale and longitudinal dataset of anonymized human mobility trajectories. Scientific Data, 11(1), 397. https://www.nature.com/articles/s41597-024-03237-9

This data contains movement information generated from user location data obtained from LY Corporation smartphone applications. It does not reveal the actual timestamp, latitude, longitude, etc., and does not identify individuals. This data can only be used for the purpose of participating in the Humob Challenge 2024.
f
DataSheet_1_Rolling Deck to Repository: Supporting the marine science...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzanne M. Carbotte; Suzanne O’Hara; Karen Stocks; P. Dru Clark; Laura Stolp; Shawn R. Smith; Kristen Briggs; Rebecca Hudak; Emily Miller; Chris J. Olson; Neville Shane; Rafael Uribe; Robert Arko; Cynthia L. Chandler; Vicki Ferrini; Stephen P. Miller; Alice Doyle; James Holik (2023). DataSheet_1_Rolling Deck to Repository: Supporting the marine science community with data management services from academic research expeditions.docx [Dataset]. http://doi.org/10.3389/fmars.2022.1012756.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fmars.2022.1012756.s001
Dataset updated
Jun 12, 2023
Dataset provided by
Frontiers
Authors
Suzanne M. Carbotte; Suzanne O’Hara; Karen Stocks; P. Dru Clark; Laura Stolp; Shawn R. Smith; Kristen Briggs; Rebecca Hudak; Emily Miller; Chris J. Olson; Neville Shane; Rafael Uribe; Robert Arko; Cynthia L. Chandler; Vicki Ferrini; Stephen P. Miller; Alice Doyle; James Holik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Direct observations of the oceans acquired on oceanographic research ships operated across the international community support fundamental research into the many disciplines of ocean science and provide essential information for monitoring the health of the oceans. A comprehensive knowledge base is needed to support the responsible stewardship of the oceans with easy access to all data acquired globally. In the United States, the multidisciplinary shipboard sensor data routinely acquired each year on the fleet of coastal, regional and global ranging vessels supporting academic marine research are managed by the Rolling Deck to Repository (R2R, rvdata.us) program. With over a decade of operations, the R2R program has developed a robust routinized system to transform diverse data contributions from different marine data providers into a standardized and comprehensive collection of global-ranging observations of marine atmosphere, ocean, seafloor and subseafloor properties that is openly available to the international research community. In this article we describe the elements and framework of the R2R program and the services provided. To manage all expeditions conducted annually, a fleet-wide approach has been developed using data distributions submitted from marine operators with a data management workflow designed to maximize automation of data curation. Other design goals are to improve the completeness and consistency of the data and metadata archived, to support data citability, provenance tracking and interoperable data access aligned with FAIR (findable, accessible, interoperable, reusable) recommendations, and to facilitate delivery of data from the fleet for global data syntheses. Findings from a collection-level review of changes in data acquisition practices and quality over the past decade are presented. Lessons learned from R2R operations are also discussed including the benefits of designing data curation around the routine practices of data providers, approaches for ensuring preservation of a more complete data collection with a high level of FAIRness, and the opportunities for homogenization of datasets from the fleet so that they can support the broadest re-use of data across a diverse user community.
Grainger products dataset
crawlfeeds.com
csv, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Grainger products dataset [Dataset]. https://crawlfeeds.com/datasets/grainger-products-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Mar 19, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Unlock the full potential of your data-driven projects with our comprehensive Grainger products dataset. This meticulously curated dataset includes detailed information on a wide range of products available on Grainger, one of the leading industrial supply companies.

This dataset is perfect for eCommerce platforms, market analysis, competitive analysis, product comparison, and more. Leverage the power of high-quality, structured data to enhance your business strategies and decision-making processes.

Versions:

Available latest version of the Grainger dataset with 1.2 Million records and last extracted on Jan 2025.

Reach out to contact@crawlfeeds.com

Use Cases:

eCommerce Platforms: Integrate detailed product information to enhance your product listings.

Market Analysis: Analyze product trends, pricing, and competition in the industrial supply market.

Inventory Management: Utilize SKUs and unique identifiers for efficient inventory tracking.

Data-Driven Projects: Incorporate rich product data into your data science and machine learning models.

Explore the vast collection of Grainger products and elevate your business insights with this high-quality dataset.
m
Data from: Match Score Dataset for Team Ball Sports
data.mendeley.com
Updated May 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thaksheel Alleck (2024). Match Score Dataset for Team Ball Sports [Dataset]. http://doi.org/10.17632/2pt4vmyf27.2
Explore at:
Unique identifier
https://doi.org/10.17632/2pt4vmyf27.2
Dataset updated
May 15, 2024
Authors
Thaksheel Alleck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains match score data from major international competitions across 12 team ball sports: basketball, cricket, field hockey, futsal, handball, ice hockey, lacrosse, roller hockey, rugby, soccer, volleyball, and water polo. The dataset was obtained by web scraping data available on Wikipedia pages and includes, for each sport, the following information related to individual matches: the year of the competition edition when a match occurred, the names of the two opposing teams, their respective scores, and the name of the winning team.
t
Telco_Customer_churn_Data
test.researchdata.tuwien.at
bin, csv, png
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erum Naz; Erum Naz; Erum Naz; Erum Naz (2025). Telco_Customer_churn_Data [Dataset]. http://doi.org/10.82556/b0ch-cn44
Explore at:
png, csv, binAvailable download formats
Unique identifier
https://doi.org/10.82556/b0ch-cn44
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Erum Naz; Erum Naz; Erum Naz; Erum Naz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Context and Methodology

The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).

The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.

The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.

Technical Details

The dataset has a tabular structure and was initially stored in CSV format. It contains:

Rows: 7,043 customer records

Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).

Naming Convention:

The table in the database is named telco_customer_churn_data.

Software Requirements:

To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).

For machine learning applications, libraries such as pandas, scikit-learn, and joblib are typically used.

Additional Resources:

Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn

Further Details

When reusing the dataset, users should be aware:

Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).

Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.
n
Rwanda Field Boundary Competition Dataset
cmr.earthdata.nasa.gov
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Rwanda Field Boundary Competition Dataset [Dataset]. http://doi.org/10.34911/rdnt.g580ww
Explore at:
Unique identifier
https://doi.org/10.34911/rdnt.g580ww
Dataset updated
Oct 10, 2023
Time period covered
Jan 1, 2020 - Jan 1, 2023
Area covered

Description
This dataset contains field boundaries for smallholder farms in eastern Rwanda. The Nasa Harvest program funded a team of annotators from TaQadam to label Planet imagery for the 2021 growing season for the purpose of conducting the Rwanda Field boundary detection Challenge. The dataset includes rasterized labeled field boundaries and time series satellite imagery from Planet's NICFI program. Planet's basemap imagery is provided for six months (March, April, August, October, November and December). The paired dataset is provided in 256x256 chips for a total of 70 tiles covering 1532 individual fields.

Input imagery consists of a time series of planet Basemaps from the NICFI program (monthly composite) data.

Imagery Copyright 2021 Planet Labs Inc. All use subject to the Participant License Agreement.

TreeAI Global Initiative - Advancing tree species identification from aerial...

zenodo.org

Updated Mar 8, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mirela Beloiu Schwenke; Mirela Beloiu Schwenke; Zhongyu Xia; Arthur Gessler; Arthur Gessler; Teja Kattenborn; Teja Kattenborn; Clemens Mosig; Clemens Mosig; Stefano Puliti; Stefano Puliti; Lars Waser; Lars Waser; Nataliia Rehush; Nataliia Rehush; Yan Cheng; Yan Cheng; Liang Xinliang; Verena C. Griess; Verena C. Griess; Martin Mokroš; Martin Mokroš; Zhongyu Xia; Liang Xinliang (2025). TreeAI Global Initiative - Advancing tree species identification from aerial images with deep learning [Dataset]. http://doi.org/10.5281/zenodo.14888706

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.14888706

Dataset updated

Mar 8, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

TreeAI - Advancing Tree Species Identification from Aerial Images with Deep Learning

Data Structure for the TreeAI Database Used in the TreeAI4Species Competition

The data are in the COCO format, each folder contains training and validation subfolders with images and labels with the tree species ID.

Training: Images (.png) and Labels (.txt)

Validation: Images (.png) and Labels (.txt)

Images: RGB bands, 8-bit, chip size 640 x 640 pixels = 32 x 32 m, 5 cm pixel spatial resolution.

Labels: labels are prepared for object detection tasks, the number of classes varies per dataset, e.g. dataset 12_RGB_all_L has 53 classes, and the Latin name of the species is given for each class ID in the file named classDatasetName.xlsx.

Species class: classDatasetName.xlsx contains 3 columns Species_ID, Labels (number of labels), and Species_Class (Latin name of the species).

Masked images: The data set with partial labels was masked, i.e. a buffer of 30 pixels was created around a label, and the image was masked based on these buffers, e.g. 34_RGB_all_L_PascalVoc_640Mask.

Additional filters to clean up the data:
Labels at the edge: only images with labels at the edge were removed.
Valid labels: images with labels that were completely within an image have been retained.

Table 1. Description of the datasets included in the TreeAI database.

a) Fully labeled images (i.e. the image has all the trees delineated and each polygon has species information)

b) Partially labeled images (i.e. the image has only some trees delineated, and each polygon has species information)

No.	Dataset name	Training images	Validation images	Fully labeled	Partially labeled
1	12_RGB5cm_FullyLabeled	1066	304	x
2	ObjectDetection_TreeSpecies	422	84	x
3	34_RGB_all_L_PascalVoc_640Mask	951	272		x
4	34_RGB_PartiallyLabeled640	917	262		x

Steps to access the dataset and participate in the TreeAI4Species competition:

Register: Access to the data will be granted upon registering for the competition, see the registration form: https://form.ethz.ch/research/tree-ai-global-database/treeai-competition.html
Request the dataset: Download the competition record after registration by requesting it. Enter your full name, purpose e.g. accept the TreeAI4Species data license, affiliation, and the country of affiliation in the request. This allows us to check whether you are already registered.
Test dataset: Only the participants registered for the competition will receive the test dataset.
Submit your DL models for evaluation by June 2025.
Award: The best models win a prize.
Publication: All participants in the competition who submit the required files for evaluation will be included in the subsequent publication.

License

== CC BY-NC-ND (Attribution-NonCommercial-NoDerivatives) ==

Dear user,

We appreciate your interest in the TreeAI4Species Competition: https://form.ethz.ch/research/tree-ai-global-database.html

DATA ANALYSIS AND PUBLICATION

The TreeAI database is released under a variant of the CC BY-NC-ND license. This database is confidential and can be used only for the TreeAI4Species data science competition. It is not permitted to pass on the data or the characteristics directly derived from it to third parties. Written consent from the data supplier is required for use for any other purpose.

LIABILITY

The data are based on the current state of existing scientific knowledge. However, there is no liability for the completeness. This is the first version of the database, and we plan to improve the tree annotations and include new tree species. Therefore, another version will be released in the future.

The data can only be used for the purpose described by the user when requesting the data.

------------------------------------------------------

ETH Zürich

Dr. Mirela Beloiu Schwenke

Institute of Terrestrial Ecosystems

Department of Environmental Systems Science, CHN K75

Universitätstrasse 16, 8092 Zürich, Schweiz

mirela.beloiu@usys.ethz.ch

d
Data from: PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton...
catalog.data.gov
data.wu.ac.at
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact) (2025). PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton Smith in Straits of Florida from 2014-06-03 to 2014-06-06 and used in the 2015 National Data Science Bowl (NCEI Accession 0127422) [Dataset]. https://catalog.data.gov/dataset/planktonset-1-0-plankton-imagery-data-collected-from-f-g-walton-smith-in-straits-of-florida-fro
Explore at:
Dataset updated
Aug 1, 2025
Dataset provided by
(Point of Contact)
Area covered
Straits of Florida
Description
Data presented here are subset of a larger plankton imagery data set collected in the subtropical Straits of Florida from 2014-05-28 to 2014-06-14. Imagery data were collected using the In Situ Ichthyoplankton Imaging System (ISIIS-2) as part of a NSF-funded project to assess the biophysical drivers affecting fine-scale interactions between larval fish, their prey, and predators. This subset of images was used in the inaugural National Data Science Bowl (www.datasciencebowl.com) hosted by Kaggle and sponsored by Booz Allen Hamilton. Data were originally collected to examine the biophysical drivers affecting fine-scale (spatial) interactions between larval fish, their prey, and predators in a subtropical pelagic marine ecosystem. Image segments extracted from the raw data were sorted into 121 plankton classes, split 50:50 into train and test data sets, and provided for a machine learning competition (the National Data Science Bowl). There was no hierarchical relationships explicit in the 121 plankton classes, though the class naming convention and a tree-like diagram (see file "Plankton Relationships.pdf") indicated relationships between classes, whether it was taxonomic or structural (size and shape). We intend for this dataset to be available to the machine learning and computer vision community as a standard machine learning benchmark. This â€œPlankton 1.0â€ dataset is a medium-size dataset with a fair amount of complexity where image classification improvements can still be made.
Public Dataset for 2020 Kaggle Survey Data
kaggle.com
Updated Jan 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Mooney (2021). Public Dataset for 2020 Kaggle Survey Data [Dataset]. https://www.kaggle.com/datasets/paultimothymooney/public-dataset-for-2020-kaggle-survey-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Paul Mooney
Description
This is just a copy of https://www.kaggle.com/c/kaggle-survey-2020/data.

It is hosted as a public Kaggle dataset instead of a Kaggle Competition Dataset to make it easier for users to find via search (we have different search menus for public datasets vs competition datasets and you have to know which one to use but it is confusing so hopefully this makes the data easier to find).

You are probably looking for https://www.kaggle.com/c/kaggle-survey-2020/data
S
A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based...
scidb.cn
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
吴有明; 刁文辉; 索玉玺; 孙显 (2025). A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images (FAIR-CSAR-V1.0) [Dataset]. http://doi.org/10.57760/sciencedb.radars.00019
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.radars.00019
Dataset updated
Feb 20, 2025
Dataset provided by
Science Data Bank
Authors
吴有明; 刁文辉; 索玉玺; 孙显
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FAIR-CSAR-V1.0 dataset, constructed on single-look complex (SLC) images of Gaofen-3 satellite, is the largest and most finely annotated SAR image dataset for fine-grained target to date. FAIR-CSAR-V1.0 aims to advance related technologies in SAR image object detection, recognition, and target characteristic understanding. The dataset is developed by Key Laboratory of Target Cognition and Application Technology (TCAT) at the Aerospace Information Research Institute, Chinese Academy of Sciences.FAIR-CSAR-V1.0 comprises 175 scenes of Gaofen-3 Level-1 SLC products, covering 32 global regions including airports, oil refineries, ports, and rivers. With a total data volume of 250 GB and over 340,000 instances, FAIR-CSAR-V1.0 covers 5 main categories and 22 subcategories, providing detailed annotations for imaging parameters (e.g., radar center frequency, pulse repetition frequency) and target characteristics (e.g., satellite-ground relative azimuthal angle, key scattering point distribution).FAIR-CSAR-V1.0 consists of two sub-datasets: the SL dataset and the FSI dataset. The SL dataset, acquired in spotlight mode with a nominal resolution of 1 meter, contains 170,000 instances across 22 target classes. The FSI dataset, acquired in fine stripmap mode with a nominal resolution of 5 meters, includes 170,000 instances across 3 target classes. Figure 1 presents an overview of the dataset.Data paper and citation format:[1] Youming Wu, Wenhui Diao, Yuxi Suo, Xian Sun. A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images (FAIR-CSAR-V1.0) [OL]. Journal of Radars, 2025. https://radars.ac.cn/web/data/getData?dataType=FAIR_CSAR_en&pageType=en.[2] Y. Wu, Y. Suo, Q. Meng, W. Dai, T. Miao, W. Zhao, Z. Yan, W. Diao, G. Xie, Q. Ke, Y. Zhao, K. Fu and X. Sun, FAIR-CSAR: A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images[J]. IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1-22, 2025, doi: 10.1109/TGRS.2024.3519891.

Facebook

Twitter

Click to copy link

Link copied

Cite

Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition

How to Win Data Science Competition

Explore at:

zip(15845091 bytes)Available download formats

Dataset updated

Jan 30, 2018

Authors

Budi Ryan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by Budi Ryan

Released under CC0: Public Domain

Clear search

Close search

Google apps

Main menu

How to Win Data Science Competition

Dataset

Contents

Data Science Glossary For QA

Kaggle-LLM-Science-Exam

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

AgriFieldNet Competition Dataset

Predict Future Sales (translated to English)

File Information

Column Description

Dataset of all user solutions and actions in the experiment.

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Data characteristics for the Kaggle.com seizure forecasting contest.

Meta_Kaggle_Competitions_cleaned_dataset

📝 Description:

LYMob-4Cities: Multi-City Human Mobility Dataset

DataSheet_1_Rolling Deck to Repository: Supporting the marine science...

Grainger products dataset

Data from: Match Score Dataset for Team Ball Sports

Telco_Customer_churn_Data

Context and Methodology

Technical Details

Further Details

Rwanda Field Boundary Competition Dataset

TreeAI Global Initiative - Advancing tree species identification from aerial...

TreeAI - Advancing Tree Species Identification from Aerial Images with Deep Learning

Data Structure for the TreeAI Database Used in the TreeAI4Species Competition

License

Data from: PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton...

Public Dataset for 2020 Kaggle Survey Data

A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based...

How to Win Data Science Competition

Dataset

Contents