100+ datasets found
  1. How to Win Data Science Competition

    • kaggle.com
    zip
    Updated Jan 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
    Explore at:
    zip(15845091 bytes)Available download formats
    Dataset updated
    Jan 30, 2018
    Authors
    Budi Ryan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Budi Ryan

    Released under CC0: Public Domain

    Contents

  2. Data Science Glossary For QA

    • kaggle.com
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofianesun (2024). Data Science Glossary For QA [Dataset]. https://www.kaggle.com/datasets/sofianesun/data-science-glossary-for-qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sofianesun
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A dataset for the 1st task Explain or teach basic data science concepts of the competition Google – AI Assistants for Data Tasks with Gemma. This dataset contains several glossaries of Data Science, where every sample contains two keys term(vocab name) and definition.

  3. h

    Kaggle-LLM-Science-Exam

    • huggingface.co
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sangeetha Venkatesan (2023). Kaggle-LLM-Science-Exam [Dataset]. https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Authors
    Sangeetha Venkatesan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for [LLM Science Exam Kaggle Competition]

      Dataset Summary
    

    https://www.kaggle.com/competitions/kaggle-llm-science-exam/data

      Languages
    

    [en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]

      Dataset Structure
    

    Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.

  4. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Aug 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(154608704973 bytes)Available download formats
    Dataset updated
    Aug 28, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  5. n

    AgriFieldNet Competition Dataset

    • cmr.earthdata.nasa.gov
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). AgriFieldNet Competition Dataset [Dataset]. http://doi.org/10.34911/rdnt.wu92p1
    Explore at:
    Dataset updated
    Oct 10, 2023
    Time period covered
    Jan 1, 2020 - Jan 1, 2023
    Area covered
    Description

    This dataset contains crop types of agricultural fields in four states of Uttar Pradesh, Rajasthan, Odisha and Bihar in northern India. There are 13 different classes in the dataset including Fallow land and 12 crop types of Wheat, Mustard, Lentil, Green pea, Sugarcane, Garlic, Maize, Gram, Coriander, Potato, Bersem, and Rice. The dataset is split to train and test collections as part of the AgriFieldNet India Competition. Ground reference data for this dataset is collected by IDinsight’s Data on Demand team. Radiant Earth Foundation carried out the training dataset curation and publication. This training dataset is generated through a grant from the Enabling Crop Analytics at Scale (ECAAS) Initiative funded by The Bill & Melinda Gates Foundation and implemented by Tetra Tech.

  6. Predict Future Sales (translated to English)

    • kaggle.com
    Updated Nov 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YWenLin (2020). Predict Future Sales (translated to English) [Dataset]. https://www.kaggle.com/datasets/ywhenlyn/predict-future-sales-translated-to-english/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    YWenLin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Original data from Predict Future Sales (Kaggle Competition) Translated items_categories.csv, shops.csv, items.csv from Russian to English for easy features engineering and references.

    File Information

    Translated item description and shop name from Russian to English items.csv - supplemental information about the items/products. item_categories.csv - supplemental information about the items categories. shops.csv- supplemental information about the shops.

    Column Description

    • ID - an Id that represents a (Shop, Item) tuple within the test set
    • shop_id - unique identifier of a shop
    • item_id - unique identifier of a product
    • item_name - name of item
    • shop_name - name of shop
    • item_category_name - name of item category
  7. Dataset of all user solutions and actions in the experiment.

    • plos.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Devon Brackbill; Damon Centola (2023). Dataset of all user solutions and actions in the experiment. [Dataset]. http://doi.org/10.1371/journal.pone.0237978.s012
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Devon Brackbill; Damon Centola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compressed (.zip) archive containing the data set in .csv format and a README.txt file explaining the columns. (ZIP)

  8. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  9. f

    Data characteristics for the Kaggle.com seizure forecasting contest.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo (2023). Data characteristics for the Kaggle.com seizure forecasting contest. [Dataset]. http://doi.org/10.1371/journal.pone.0178808.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source: [9].

  10. Meta_Kaggle_Competitions_cleaned_dataset

    • kaggle.com
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarvpreet Kaur (2025). Meta_Kaggle_Competitions_cleaned_dataset [Dataset]. https://www.kaggle.com/datasets/sarvpreetkaur22/meta-kaggle-competitions-cleaned-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sarvpreet Kaur
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📝 Description:

    A cleaned version of Competitions.csv focused on timeline analysis.

    ✅ Includes: CompetitionId, Title, Deadline, EnabledDate, HostSegmentTitle ✅ Helps understand growth over time, and regional hosting focus ✅ Can be joined with teams_clean.csv and user_achievements_clean.csv

  11. LYMob-4Cities: Multi-City Human Mobility Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takahiro Yabe; Takahiro Yabe; Kota Tsubouchi; Kota Tsubouchi; Toru Shimizu; Toru Shimizu (2025). LYMob-4Cities: Multi-City Human Mobility Dataset [Dataset]. http://doi.org/10.5281/zenodo.14219563
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Takahiro Yabe; Takahiro Yabe; Kota Tsubouchi; Kota Tsubouchi; Toru Shimizu; Toru Shimizu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This multi-city human mobility dataset contains data from 4 metropolitan areas (cities A, B, C, D), somewhere in Japan. Each city is divided into 500 meters x 500 meters cells, which span a 200 x 200 grid. The human mobility datasets contain the movement of individuals across a 75-day period, discretized into 30-minute intervals and 500-meter grid cells. Each city contains the movement data of 100,000, 25,000, 20,000, and 6,000 individuals, respectively.

    While the name or location of the city is not disclosed, the participants are provided with points-of-interest (POIs; e.g., restaurants, parks) data for each grid cell (~85 dimensional vector) for the four cities as supplementary information (e.g., POIdata_cityA). The list of 85 POI categories can be found in POI_datacategories.csv.

    This dataset was used for the HuMob Data Challenge 2024 competition. For more details, see https://wp.nyu.edu/humobchallenge2024/

    Researchers may use this dataset for publications and reports, as long as: 1) Users shall not carry out activities that involve unethical usage of the data, including attempts at re-identifying data subjects, harming individuals, or damaging companies, and 2) The Data Descriptor paper of an earlier version of the dataset (citation below) needs to be cited when using the data for research and/or commercial purposes. Downloading this dataset implies agreement with the above two conditions.

    • Yabe, T., Tsubouchi, K., Shimizu, T., Sekimoto, Y., Sezaki, K., Moro, E., & Pentland, A. (2024). YJMob100K: City-scale and longitudinal dataset of anonymized human mobility trajectories. Scientific Data, 11(1), 397. https://www.nature.com/articles/s41597-024-03237-9

    This data contains movement information generated from user location data obtained from LY Corporation smartphone applications. It does not reveal the actual timestamp, latitude, longitude, etc., and does not identify individuals. This data can only be used for the purpose of participating in the Humob Challenge 2024.

  12. f

    DataSheet_1_Rolling Deck to Repository: Supporting the marine science...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzanne M. Carbotte; Suzanne O’Hara; Karen Stocks; P. Dru Clark; Laura Stolp; Shawn R. Smith; Kristen Briggs; Rebecca Hudak; Emily Miller; Chris J. Olson; Neville Shane; Rafael Uribe; Robert Arko; Cynthia L. Chandler; Vicki Ferrini; Stephen P. Miller; Alice Doyle; James Holik (2023). DataSheet_1_Rolling Deck to Repository: Supporting the marine science community with data management services from academic research expeditions.docx [Dataset]. http://doi.org/10.3389/fmars.2022.1012756.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 12, 2023
    Dataset provided by
    Frontiers
    Authors
    Suzanne M. Carbotte; Suzanne O’Hara; Karen Stocks; P. Dru Clark; Laura Stolp; Shawn R. Smith; Kristen Briggs; Rebecca Hudak; Emily Miller; Chris J. Olson; Neville Shane; Rafael Uribe; Robert Arko; Cynthia L. Chandler; Vicki Ferrini; Stephen P. Miller; Alice Doyle; James Holik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Direct observations of the oceans acquired on oceanographic research ships operated across the international community support fundamental research into the many disciplines of ocean science and provide essential information for monitoring the health of the oceans. A comprehensive knowledge base is needed to support the responsible stewardship of the oceans with easy access to all data acquired globally. In the United States, the multidisciplinary shipboard sensor data routinely acquired each year on the fleet of coastal, regional and global ranging vessels supporting academic marine research are managed by the Rolling Deck to Repository (R2R, rvdata.us) program. With over a decade of operations, the R2R program has developed a robust routinized system to transform diverse data contributions from different marine data providers into a standardized and comprehensive collection of global-ranging observations of marine atmosphere, ocean, seafloor and subseafloor properties that is openly available to the international research community. In this article we describe the elements and framework of the R2R program and the services provided. To manage all expeditions conducted annually, a fleet-wide approach has been developed using data distributions submitted from marine operators with a data management workflow designed to maximize automation of data curation. Other design goals are to improve the completeness and consistency of the data and metadata archived, to support data citability, provenance tracking and interoperable data access aligned with FAIR (findable, accessible, interoperable, reusable) recommendations, and to facilitate delivery of data from the fleet for global data syntheses. Findings from a collection-level review of changes in data acquisition practices and quality over the past decade are presented. Lessons learned from R2R operations are also discussed including the benefits of designing data curation around the routine practices of data providers, approaches for ensuring preservation of a more complete data collection with a high level of FAIRness, and the opportunities for homogenization of datasets from the fleet so that they can support the broadest re-use of data across a diverse user community.

  13. Grainger products dataset

    • crawlfeeds.com
    csv, zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Grainger products dataset [Dataset]. https://crawlfeeds.com/datasets/grainger-products-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Unlock the full potential of your data-driven projects with our comprehensive Grainger products dataset. This meticulously curated dataset includes detailed information on a wide range of products available on Grainger, one of the leading industrial supply companies.

    This dataset is perfect for eCommerce platforms, market analysis, competitive analysis, product comparison, and more. Leverage the power of high-quality, structured data to enhance your business strategies and decision-making processes.

    Versions:

    Available latest version of the Grainger dataset with 1.2 Million records and last extracted on Jan 2025.

    Reach out to contact@crawlfeeds.com

    Use Cases:

    • eCommerce Platforms: Integrate detailed product information to enhance your product listings.
    • Market Analysis: Analyze product trends, pricing, and competition in the industrial supply market.
    • Inventory Management: Utilize SKUs and unique identifiers for efficient inventory tracking.
    • Data-Driven Projects: Incorporate rich product data into your data science and machine learning models.

    Explore the vast collection of Grainger products and elevate your business insights with this high-quality dataset.

  14. m

    Data from: Match Score Dataset for Team Ball Sports

    • data.mendeley.com
    Updated May 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thaksheel Alleck (2024). Match Score Dataset for Team Ball Sports [Dataset]. http://doi.org/10.17632/2pt4vmyf27.2
    Explore at:
    Dataset updated
    May 15, 2024
    Authors
    Thaksheel Alleck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains match score data from major international competitions across 12 team ball sports: basketball, cricket, field hockey, futsal, handball, ice hockey, lacrosse, roller hockey, rugby, soccer, volleyball, and water polo. The dataset was obtained by web scraping data available on Wikipedia pages and includes, for each sport, the following information related to individual matches: the year of the competition edition when a match occurred, the names of the two opposing teams, their respective scores, and the name of the winning team.

  15. t

    Telco_Customer_churn_Data

    • test.researchdata.tuwien.at
    bin, csv, png
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erum Naz; Erum Naz; Erum Naz; Erum Naz (2025). Telco_Customer_churn_Data [Dataset]. http://doi.org/10.82556/b0ch-cn44
    Explore at:
    png, csv, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Erum Naz; Erum Naz; Erum Naz; Erum Naz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Context and Methodology

    The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).

    The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
    The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.

    The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.

    Technical Details

    The dataset has a tabular structure and was initially stored in CSV format. It contains:

    • Rows: 7,043 customer records

    • Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).

    Naming Convention:

    • The table in the database is named telco_customer_churn_data.

    Software Requirements:

    • To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).

    • For machine learning applications, libraries such as pandas, scikit-learn, and joblib are typically used.

    Additional Resources:

    Further Details

    When reusing the dataset, users should be aware:

    • Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    • Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).

    • Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.

  16. n

    Rwanda Field Boundary Competition Dataset

    • cmr.earthdata.nasa.gov
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Rwanda Field Boundary Competition Dataset [Dataset]. http://doi.org/10.34911/rdnt.g580ww
    Explore at:
    Dataset updated
    Oct 10, 2023
    Time period covered
    Jan 1, 2020 - Jan 1, 2023
    Area covered
    Description

    This dataset contains field boundaries for smallholder farms in eastern Rwanda. The Nasa Harvest program funded a team of annotators from TaQadam to label Planet imagery for the 2021 growing season for the purpose of conducting the Rwanda Field boundary detection Challenge. The dataset includes rasterized labeled field boundaries and time series satellite imagery from Planet's NICFI program. Planet's basemap imagery is provided for six months (March, April, August, October, November and December). The paired dataset is provided in 256x256 chips for a total of 70 tiles covering 1532 individual fields.

    Input imagery consists of a time series of planet Basemaps from the NICFI program (monthly composite) data.

    Imagery Copyright 2021 Planet Labs Inc. All use subject to the Participant License Agreement.

  17. TreeAI Global Initiative - Advancing tree species identification from aerial...

    • zenodo.org
    Updated Mar 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirela Beloiu Schwenke; Mirela Beloiu Schwenke; Zhongyu Xia; Arthur Gessler; Arthur Gessler; Teja Kattenborn; Teja Kattenborn; Clemens Mosig; Clemens Mosig; Stefano Puliti; Stefano Puliti; Lars Waser; Lars Waser; Nataliia Rehush; Nataliia Rehush; Yan Cheng; Yan Cheng; Liang Xinliang; Verena C. Griess; Verena C. Griess; Martin Mokroš; Martin Mokroš; Zhongyu Xia; Liang Xinliang (2025). TreeAI Global Initiative - Advancing tree species identification from aerial images with deep learning [Dataset]. http://doi.org/10.5281/zenodo.14888706
    Explore at:
    Dataset updated
    Mar 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirela Beloiu Schwenke; Mirela Beloiu Schwenke; Zhongyu Xia; Arthur Gessler; Arthur Gessler; Teja Kattenborn; Teja Kattenborn; Clemens Mosig; Clemens Mosig; Stefano Puliti; Stefano Puliti; Lars Waser; Lars Waser; Nataliia Rehush; Nataliia Rehush; Yan Cheng; Yan Cheng; Liang Xinliang; Verena C. Griess; Verena C. Griess; Martin Mokroš; Martin Mokroš; Zhongyu Xia; Liang Xinliang
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    TreeAI - Advancing Tree Species Identification from Aerial Images with Deep Learning

    Data Structure for the TreeAI Database Used in the TreeAI4Species Competition

    The data are in the COCO format, each folder contains training and validation subfolders with images and labels with the tree species ID.
    Training: Images (.png) and Labels (.txt)
    Validation: Images (.png) and Labels (.txt)
    Images: RGB bands, 8-bit, chip size 640 x 640 pixels = 32 x 32 m, 5 cm pixel spatial resolution.
    Labels: labels are prepared for object detection tasks, the number of classes varies per dataset, e.g. dataset 12_RGB_all_L has 53 classes, and the Latin name of the species is given for each class ID in the file named classDatasetName.xlsx.
    Species class: classDatasetName.xlsx contains 3 columns Species_ID, Labels (number of labels), and Species_Class (Latin name of the species).
    Masked images: The data set with partial labels was masked, i.e. a buffer of 30 pixels was created around a label, and the image was masked based on these buffers, e.g. 34_RGB_all_L_PascalVoc_640Mask.
    Additional filters to clean up the data:
    Labels at the edge: only images with labels at the edge were removed.
    Valid labels: images with labels that were completely within an image have been retained.
    Table 1. Description of the datasets included in the TreeAI database.

    a) Fully labeled images (i.e. the image has all the trees delineated and each polygon has species information)

    b) Partially labeled images (i.e. the image has only some trees delineated, and each polygon has species information)

    No.Dataset nameTraining imagesValidation imagesFully labeledPartially labeled
    112_RGB5cm_FullyLabeled1066304x
    2ObjectDetection_TreeSpecies42284x
    334_RGB_all_L_PascalVoc_640Mask951272 x
    434_RGB_PartiallyLabeled640917262 x
    Steps to access the dataset and participate in the TreeAI4Species competition:
    • Register: Access to the data will be granted upon registering for the competition, see the registration form: https://form.ethz.ch/research/tree-ai-global-database/treeai-competition.html
    • Request the dataset: Download the competition record after registration by requesting it. Enter your full name, purpose e.g. accept the TreeAI4Species data license, affiliation, and the country of affiliation in the request. This allows us to check whether you are already registered.
    • Test dataset: Only the participants registered for the competition will receive the test dataset.
    • Submit your DL models for evaluation by June 2025.
    • Award: The best models win a prize.
    • Publication: All participants in the competition who submit the required files for evaluation will be included in the subsequent publication.

    License

    == CC BY-NC-ND (Attribution-NonCommercial-NoDerivatives) ==
    Dear user,
    DATA ANALYSIS AND PUBLICATION
    The TreeAI database is released under a variant of the CC BY-NC-ND license. This database is confidential and can be used only for the TreeAI4Species data science competition. It is not permitted to pass on the data or the characteristics directly derived from it to third parties. Written consent from the data supplier is required for use for any other purpose.
    LIABILITY
    The data are based on the current state of existing scientific knowledge. However, there is no liability for the completeness. This is the first version of the database, and we plan to improve the tree annotations and include new tree species. Therefore, another version will be released in the future.
    The data can only be used for the purpose described by the user when requesting the data.
    ------------------------------------------------------
    ETH Zürich
    Dr. Mirela Beloiu Schwenke
    Institute of Terrestrial Ecosystems
    Department of Environmental Systems Science, CHN K75
    Universitätstrasse 16, 8092 Zürich, Schweiz
    mirela.beloiu@usys.ethz.ch

  18. d

    Data from: PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton...

    • catalog.data.gov
    • data.wu.ac.at
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton Smith in Straits of Florida from 2014-06-03 to 2014-06-06 and used in the 2015 National Data Science Bowl (NCEI Accession 0127422) [Dataset]. https://catalog.data.gov/dataset/planktonset-1-0-plankton-imagery-data-collected-from-f-g-walton-smith-in-straits-of-florida-fro
    Explore at:
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    (Point of Contact)
    Area covered
    Straits of Florida
    Description

    Data presented here are subset of a larger plankton imagery data set collected in the subtropical Straits of Florida from 2014-05-28 to 2014-06-14. Imagery data were collected using the In Situ Ichthyoplankton Imaging System (ISIIS-2) as part of a NSF-funded project to assess the biophysical drivers affecting fine-scale interactions between larval fish, their prey, and predators. This subset of images was used in the inaugural National Data Science Bowl (www.datasciencebowl.com) hosted by Kaggle and sponsored by Booz Allen Hamilton. Data were originally collected to examine the biophysical drivers affecting fine-scale (spatial) interactions between larval fish, their prey, and predators in a subtropical pelagic marine ecosystem. Image segments extracted from the raw data were sorted into 121 plankton classes, split 50:50 into train and test data sets, and provided for a machine learning competition (the National Data Science Bowl). There was no hierarchical relationships explicit in the 121 plankton classes, though the class naming convention and a tree-like diagram (see file "Plankton Relationships.pdf") indicated relationships between classes, whether it was taxonomic or structural (size and shape). We intend for this dataset to be available to the machine learning and computer vision community as a standard machine learning benchmark. This “Plankton 1.0†dataset is a medium-size dataset with a fair amount of complexity where image classification improvements can still be made.

  19. Public Dataset for 2020 Kaggle Survey Data

    • kaggle.com
    Updated Jan 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Mooney (2021). Public Dataset for 2020 Kaggle Survey Data [Dataset]. https://www.kaggle.com/datasets/paultimothymooney/public-dataset-for-2020-kaggle-survey-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Paul Mooney
    Description

    This is just a copy of https://www.kaggle.com/c/kaggle-survey-2020/data.

    It is hosted as a public Kaggle dataset instead of a Kaggle Competition Dataset to make it easier for users to find via search (we have different search menus for public datasets vs competition datasets and you have to know which one to use but it is confusing so hopefully this makes the data easier to find).

    You are probably looking for https://www.kaggle.com/c/kaggle-survey-2020/data

  20. S

    A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based...

    • scidb.cn
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    吴有明; 刁文辉; 索玉玺; 孙显 (2025). A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images (FAIR-CSAR-V1.0) [Dataset]. http://doi.org/10.57760/sciencedb.radars.00019
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Science Data Bank
    Authors
    吴有明; 刁文辉; 索玉玺; 孙显
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FAIR-CSAR-V1.0 dataset, constructed on single-look complex (SLC) images of Gaofen-3 satellite, is the largest and most finely annotated SAR image dataset for fine-grained target to date. FAIR-CSAR-V1.0 aims to advance related technologies in SAR image object detection, recognition, and target characteristic understanding. The dataset is developed by Key Laboratory of Target Cognition and Application Technology (TCAT) at the Aerospace Information Research Institute, Chinese Academy of Sciences.FAIR-CSAR-V1.0 comprises 175 scenes of Gaofen-3 Level-1 SLC products, covering 32 global regions including airports, oil refineries, ports, and rivers. With a total data volume of 250 GB and over 340,000 instances, FAIR-CSAR-V1.0 covers 5 main categories and 22 subcategories, providing detailed annotations for imaging parameters (e.g., radar center frequency, pulse repetition frequency) and target characteristics (e.g., satellite-ground relative azimuthal angle, key scattering point distribution).FAIR-CSAR-V1.0 consists of two sub-datasets: the SL dataset and the FSI dataset. The SL dataset, acquired in spotlight mode with a nominal resolution of 1 meter, contains 170,000 instances across 22 target classes. The FSI dataset, acquired in fine stripmap mode with a nominal resolution of 5 meters, includes 170,000 instances across 3 target classes. Figure 1 presents an overview of the dataset.Data paper and citation format:[1] Youming Wu, Wenhui Diao, Yuxi Suo, Xian Sun. A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images (FAIR-CSAR-V1.0) [OL]. Journal of Radars, 2025. https://radars.ac.cn/web/data/getData?dataType=FAIR_CSAR_en&pageType=en.[2] Y. Wu, Y. Suo, Q. Meng, W. Dai, T. Miao, W. Zhao, Z. Yan, W. Diao, G. Xie, Q. Ke, Y. Zhao, K. Fu and X. Sun, FAIR-CSAR: A Benchmark Dataset for Fine-Grained Object Detection and Recognition Based on Single-Look Complex SAR Images[J]. IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1-22, 2025, doi: 10.1109/TGRS.2024.3519891.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
Organization logo

How to Win Data Science Competition

Explore at:
zip(15845091 bytes)Available download formats
Dataset updated
Jan 30, 2018
Authors
Budi Ryan
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset

This dataset was created by Budi Ryan

Released under CC0: Public Domain

Contents

Search
Clear search
Close search
Google apps
Main menu