24 datasets found
  1. 🎓 365DS Practice Exams • People Analytics Dataset

    • kaggle.com
    zip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ísis Santos Costa (2025). 🎓 365DS Practice Exams • People Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/isissantoscosta/365ds-practice-exams-people-analytics-dataset
    Explore at:
    zip(61775349 bytes)Available download formats
    Dataset updated
    May 20, 2025
    Authors
    Ísis Santos Costa
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset has been uploaded to Kaggle on the occasion of solving questions of the 365 Data Science • Practice Exams: SQL curriculum, a set of free resources designed to help test and elevate data science skills. The dataset consists of a synthetic, relational collection of data structured to simulate common employee and organizational data scenarios, ideal for practicing SQL queries and data analysis skills in a People Analytics context.

    The dataset contains the following tables:

    departments.csv: List of all company departments. dept_emp.csv: Historical and current assignments of employees to departments. dept_manager.csv: Historical and current assignments of employees as department managers. employees.csv: Core employee demographic information. employees.db: A SQLite database containing all the relational tables from the CSV files. salaries.csv: Historical salary records for employees. titles.csv: Historical job titles held by employees.

    Usage

    The dataset is ideal for practicing SQL queries and data analysis skills in a People Analytics context. It serves applications on both general Data Analytics, and also Time Series Analysis.

    A practical application is presented on the 🎓 365DS Practice Exams • SQL notebook, which covers in detail answers to the questions of SQL Practice Exams 1, 2, and 3 on the 365DS platform, especially ilustrating the usage and the value of SQL procedures and functions.

    Acknowledgements & Data Origin

    This dataset has a rich lineage, originating from academic research and evolving through various formats to its current relational structure:

    Original Authors

    The foundational dataset was authored by Prof. Dr. Fusheng Wang 🔗 (then a PhD student at the University of California, Los Angeles - UCLA) and his advisor, Prof. Dr. Carlo Zaniolo 🔗 (UCLA). This work is primarily described in their paper:

    Relational Conversion

    It was originally distributed as an .xml file. Giuseppe Maxia (known as @datacharmer on GitHub🔗 and LinkedIn🔗, as well as here on Kaggle) converted it into its relational form and subsequently distributed it as a .sql file, making it accessible for relational database use.

    Kaggle Upload

    This .sql version was then loaded to Kaggle as the « Employees Dataset » by Mirza Huzaifa🔗 on February 5th, 2023.

  2. Cafe Sales - Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(113510 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Cafe Sales Dataset

    Overview

    The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

    File Information

    • File Name: dirty_cafe_sales.csv
    • Number of Rows: 10,000
    • Number of Columns: 8

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
    QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
    Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
    Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
    Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

    Data Characteristics

    1. Missing Values:

      • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
    2. Invalid Values:

      • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
    3. Price Consistency:

      • Prices for menu items are consistent but may have missing or incorrect values introduced.

    Menu Items

    The dataset includes the following menu items with their respective price ranges:

    ItemPrice($)
    Coffee2
    Tea1.5
    Sandwich4
    Salad5
    Cake3
    Cookie1
    Smoothie4
    Juice3

    Use Cases

    This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

    Cleaning Steps Suggestions

    To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

    1. Handle Invalid Values:

      • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
    2. Date Consistency:

      • Ensure all dates are in a consistent format.
      • Fill missing dates with plausible values based on nearby records.
    3. Feature Engineering:

      • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

    License

    This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

    Feedback

    If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

  3. Healthcare Device Data Analysis with R

    • kaggle.com
    zip
    Updated Oct 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    stanley888cy (2021). Healthcare Device Data Analysis with R [Dataset]. https://www.kaggle.com/stanley888cy/google-project-02
    Explore at:
    zip(353177 bytes)Available download formats
    Dataset updated
    Oct 7, 2021
    Authors
    stanley888cy
    Description

    Context

    Hi. This is my data analysis project and also try using R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of data from health monitoring device. For detailed background story, please check the pdf file (Case 02.pdf) for reference.

    In this case study, I use personal health tracker data from Fitbit to evaluate the how the usage of health tracker device, and then determine if there are any trends or patterns.

    My data analysis will be focus in 2 area: exercise activity and sleeping habit. Exercise activity will be a study of relationship between activity type and calories consumed, while sleeping habit will be identify any the pattern of user sleeping. In this analysis, I will also try to use some linear regression model, so that the data can be explain in a quantitative way and make prediction easier.

    I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.

    Stanley Cheng 2021-10-07

  4. H

    Political Analysis Using R: Example Code and Data, Plus Data for Practice...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Jamie Monogan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.

  5. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  6. Breast Cancer Dataset

    • kaggle.com
    zip
    Updated Jun 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ms. Nancy Al Aswad (2022). Breast Cancer Dataset [Dataset]. https://www.kaggle.com/datasets/nancyalaswad90/breast-cancer-dataset/code
    Explore at:
    zip(49781 bytes)Available download formats
    Dataset updated
    Jun 17, 2022
    Authors
    Ms. Nancy Al Aswad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    What is Breast Cancer Dataset?

    Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.

    .

    https://user-images.githubusercontent.com/36210723/182301443-382b14e1-71c1-46ac-88f5-e72a9b2083e7.jpg" alt="cancer-1">

    .

    How to use this dataset

    The key challenge against its detection is how to classify tumors into malignant (cancerous) or benign(non-cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.

    Acknowledgments

    When we use this dataset in our research, we credit the authors as :

    The main idea for uploading this dataset is to practice data analysis with my students, as I am working in college and want my student to train our studying ideas in a big dataset, It may be not up to date and I mention the collecting years, but it is a good resource of data to practice

  7. E-commerce_dataset

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhay Ayare (2025). E-commerce_dataset [Dataset]. https://www.kaggle.com/datasets/abhayayare/e-commerce-dataset
    Explore at:
    zip(644123 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    Abhay Ayare
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    E-commerce_dataset

    This dataset is a synthetic yet realistic E-commerce retail dataset generated programmatically using Python (Faker + NumPy + Pandas).
    It is designed to closely mimic real-world online shopping behavior, user patterns, product interactions, seasonal trends, and marketplace events.
    
    

    You can use this dataset for:

    Machine Learning & Deep Learning
    Recommender Systems
    Customer Segmentation
    Sales Forecasting
    A/B Testing
    E-commerce Behaviour Analysis
    Data Cleaning / Feature Engineering Practice
    SQL practice
    

    📁**Dataset Contents**

    The dataset contains 6 CSV files: ~~~ File Rows Description users.csv ~10,000 User profiles, demographics & signup info products.csv ~2,000 Product catalog with rating and pricing orders.csv ~20,000 Order-level transactions order_items.csv ~60,000 Items purchased per order reviews.csv ~15,000 Customer-written product reviews events.csv ~80,000 User event logs: view, cart, wishlist, purchase ~~~

    🧬 Data Dictionary

    1. Users (users.csv)
    Column Description
    user_id Unique user identifier
    name  Full customer name
    email  Email (synthetic, no real emails)
    gender Male / Female / Other
    city  City of residence
    signup_date Account creation date
    
    2. Products (products.csv)
    Column Description
    product_id Unique product identifier
    product_name  Product title
    category  Electronics, Clothing, Beauty, Home, Sports, etc.
    price  Actual selling price
    rating Average product rating
    
    3. Orders (orders.csv)
    Column Description
    order_id  Unique order identifier
    user_id User who placed the order
    order_date Timestamp of the order
    order_status  Completed / Cancelled / Returned
    total_amount  Total order value
    
    4. Order Items (order_items.csv)
    Column Description
    order_item_id  Unique identifier
    order_id  Associated order
    product_id Purchased product
    quantity  Quantity purchased
    item_price Price per unit
    
    5. Reviews (reviews.csv)
    Column Description
    review_id  Unique review identifier
    user_id User who submitted review
    product_id Reviewed product
    rating 1–5 star rating
    review_text Short synthetic review
    review_date Submission date
    
    6. Events (events.csv)
    Column Description
    event_id  Unique event identifier
    user_id User performing event
    product_id Viewed/added/purchased product
    event_type view/cart/wishlist/purchase
    event_timestamp Timestamp of event
    

    🧠 Possible Use Cases (Ideas & Projects)

    🔍 Machine Learning

    Customer churn prediction
    Review sentiment analysis (NLP)
    Recommendation engines
    Price optimization models
    Demand forecasting (Time-series)
    

    📦 Business Analytics

    Market basket analysis
    RFM segmentation
    Cohort analysis
    Funnel conversion tracking
    A/B testing simulations
    

    🧮 SQL Practice

    Joins
    Window functions
    Aggregations
    CTE-based funnels
    Complex queries
    

    🛠 How the Dataset Was Generated

    The dataset was generated entirely in Python using:

    Faker for realistic user and review generation
    NumPy for probability-based event modeling
    Pandas for data processing
    

    Custom logic for:

    demand variation
    user behavior simulation
    return/cancel probabilities
    seasonal order timestamp distribution
    The dataset does not include any real personal data.
    Everything is generated synthetically.
    

    ⚠️ License

    This dataset is released under CC BY 4.0 — free to use for:
    Research
    Education
    Commercial projects
    Kaggle competitions
    Machine learning pipelines
    Just provide attribution.
    

    ⭐ If you found this dataset helpful, please:

    Upvote the dataset
    Leave a comment
    Share your notebooks/notebooks using it
    
  8. Premier League Matches Dataset - 2021 to 2025

    • kaggle.com
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    armin2080 (2025). Premier League Matches Dataset - 2021 to 2025 [Dataset]. https://www.kaggle.com/datasets/armin2080/premier-league-matches-dataset-2021-to-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2025
    Dataset provided by
    Kaggle
    Authors
    armin2080
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains detailed information on all Premier League matches played between the 2021 and 2025 seasons. It includes match dates, times, venues, results, goals scored (gf), goals against (ga), expected goals (xg), possession percentages, attendance figures, team formations, referees, and other relevant statistics. This data can be used for analysis, modeling predictions, or exploring trends in Premier League football.

    Columns:

    Column NameDescription
    dateThe date of the match (format: MM/DD/YYYY)
    timeThe time of the match (in 24-hour format)
    compCompetition name (e.g., Premier League)
    roundMatch round or week number
    dayDay of the week when the match was played
    venueVenue where the match took place
    resultResult of the match (W for Win, D for Draw, L for Loss)
    gfGoals For - number of goals scored by the home team
    gaGoals Against - number of goals conceded by the home team
    opponentName of the opposing team
    xgExpected Goals for the home team
    xgaExpected Goals Against for the home team
    possPossession percentage
    attendanceNumber of spectators attending the match
    captainCaptain's name for the home team
    formationFormation used by the home team
    opp formationFormation used by the opponent
    refereeReferee officiating the match
    match reportLink or reference to a detailed match report
    notesAdditional notes regarding specific matches
    shTotal shots taken by the home team
    sotShots on target by the home team
    distAverage distance covered in shots (in meters)
    fkNumber of free kicks awarded to the home team
    pkNumber of penalties awarded to the home team
    pkattNumber of penalties attempted by the home team
    teamName of the home team
    seasonSeason during which matches were played
  9. Customer Shopping Trends Dataset

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
    Explore at:
    zip(149846 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  10. SQL Case Study for Data Analysts

    • kaggle.com
    zip
    Updated Jan 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShravyaShetty1 (2025). SQL Case Study for Data Analysts [Dataset]. https://www.kaggle.com/datasets/shravyashetty1/sql-basic-case-study
    Explore at:
    zip(59519 bytes)Available download formats
    Dataset updated
    Jan 29, 2025
    Authors
    ShravyaShetty1
    Description

    This dataset is a practical SQL case study designed for learners who are looking to enhance their SQL skills in analyzing sales, products, and marketing data. It contains several SQL queries related to a simulated business database for product sales, marketing expenses, and location data. The database consists of three main tables: Fact, Product, and Location.

    Objective of the Case Study: The purpose of this case study is to provide learners with a variety of practical SQL exercises that involve real-world business problems. The queries explore topics such as:

    • Aggregating data (e.g., sum, count, average)
    • Filtering and sorting data
    • Grouping and joining multiple tables
    • Using SQL functions like AVG(), COUNT(), SUM(), and MIN/MAX()
    • Handling advanced SQL features such as row numbering, transactions, and stored procedures
  11. NBA-National Board of Accreditation-dataset

    • kaggle.com
    zip
    Updated Jul 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Ardeshna (2023). NBA-National Board of Accreditation-dataset [Dataset]. https://www.kaggle.com/datasets/shivamardeshna/nba-national-board-of-accreditation-dataset/code
    Explore at:
    zip(11342 bytes)Available download formats
    Dataset updated
    Jul 29, 2023
    Authors
    Shivam Ardeshna
    Description

    The description is a more detailed explanation of the dataset's content, source, and potential use cases. It helps users understand the dataset's relevance and usefulness for their projects. Here's an example description for the NBA players performance dataset: Description Example: "This dataset contains comprehensive performance statistics for NBA players from the 2020-2021 season. It includes player-level data such as points scored, rebounds, assists, field goal percentage, free throw percentage, and more. The data was collected from official NBA records and other reputable sources.

    The dataset can be used for various data analysis and machine learning tasks related to NBA player performance. Analysts and researchers can explore player trends, compare individual performances, identify standout players, and investigate correlations between different performance metrics.

    Whether you're an NBA enthusiast, a data scientist, or a basketball coach, this dataset provides valuable insights into the statistical aspects of player performance in the 2020-2021 NBA season. It is ideal for data-driven research, building predictive models, and gaining a deeper understanding of player contributions to their teams."

  12. E-commerce Customer Behaviour Dataset

    • kaggle.com
    zip
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Samuel W E (2025). E-commerce Customer Behaviour Dataset [Dataset]. https://www.kaggle.com/datasets/paulsamuelwe/e-commerce-customer-behaviour-dataset
    Explore at:
    zip(10257 bytes)Available download formats
    Dataset updated
    Sep 27, 2025
    Authors
    Paul Samuel W E
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    E-Commerce Customer Behavior Dataset

    The E-Commerce Customer Behavior Dataset is a synthetic dataset designed to capture the full spectrum of customer interactions with an online retail platform. Created by Gretel AI for educational and research purposes, it provides a comprehensive view of how customers browse, purchase, and review products. The dataset is ideal for data science practice, machine learning modeling, and exploratory analytics.

    Features and Variables

    Customer ID

    • Unique identifier for each customer.
    • Allows tracking customer behavior across multiple features.

    Age

    • Numeric value representing customer age.
    • Useful for demographic analysis and segmentation.

    Gender

    • Categorical: Male, Female, Other.
    • Enables study of gender-specific purchasing patterns.

    Location

    • Geographic location of the customer (city or region).
    • Supports regional analysis and location-based marketing insights.

    Annual Income

    • Customer’s annual income in USD.
    • Key for understanding purchasing power and spending habits.

    Purchase History

    • Structured list of products purchased, including:

      • Date of purchase
      • Product category
      • Price
    • Allows analysis of repeat purchases, product popularity, and category trends.

    Browsing History

    • Records of products viewed by the customer with timestamps.
    • Useful to study engagement patterns, interests, and conversion likelihood.

    Product Reviews

    • Textual reviews and ratings (1–5 stars) provided by customers.
    • Enables qualitative analysis of customer satisfaction and sentiment.

    Time on Site

    • Total duration (in minutes) spent by the customer per session.
    • Indicator of user engagement and browsing intensity.

    Data Summary

    FeatureRange / DistributionNotes
    Age24–65Mean: 40, Std: 11
    GenderFemale 52%, Male 36%, Other 12%Categorical
    LocationMost common: City D (24%), City E (12%), Other (64%)Regional trends
    Annual Income$40,000–$100,000Mean: $65,800, Std: $16,900
    Time on Site32.5–486.3 minsMean: 233, Std: 109

    Example Entries

    Purchase History

    [
     {"Date": "2022-03-05", "Category": "Clothing", "Price": 34.99},
     {"Date": "2022-02-12", "Category": "Electronics", "Price": 129.99},
     {"Date": "2022-01-20", "Category": "Home & Garden", "Price": 29.99}
    ]
    

    Browsing History

    [
     {"Timestamp": "2022-03-10T14:30:00Z"},
     {"Timestamp": "2022-03-11T09:45:00Z"},
     {"Timestamp": "2022-03-12T16:20:00Z"}
    ]
    

    Product Review

    {
     "Review Text": "Excellent product, highly recommend!",
     "Rating": 5
    }
    

    Methodology

    This dataset was synthetically generated using machine learning techniques to simulate realistic customer behavior:

    1. Pattern Recognition Identifying trends and correlations observed in real-world e-commerce datasets.

    2. Synthetic Data Generation Producing data points for all features while preserving realistic relationships.

    3. Controlled Variation Introducing diversity to reflect a wide range of customer behaviors while maintaining logical consistency.

    Potential Use Cases

    • Customer segmentation and profiling
    • Predictive modeling of purchases and churn
    • Recommender system development
    • Sentiment analysis and natural language processing on reviews
    • Engagement and behavioral analytics

    License

    CC BY 4.0 (Attribution 4.0 International) Free to use for educational and research purposes with attribution.

    Important Notes

    • This dataset is fully synthetic — it contains no personal or sensitive information.
    • Ideal for learners, educators, and researchers looking to practice analytics and machine learning in a realistic e-commerce context.
  13. NYC Yellow Taxi Trip Data

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elemento (2021). NYC Yellow Taxi Trip Data [Dataset]. https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data
    Explore at:
    zip(1915626894 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Elemento
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Area covered
    New York
    Description

    Context

    New York City (NYC) Taxi & Limousine Commission (TLC) keeps data from all its cabs, and it is freely available to download from its official website. You can access it here. Now, the TLC primarily keeps and manages data for 4 different types of vehicles: - Yellow Taxi: Yellow Medallion Taxicabs: These are the famous NYC yellow taxis that provide transportation exclusively through street hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged. - Green Taxi: Street Hail Livery: The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides. - For-Hire Vehicles (FHVs): FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

    Complimentary Kernel

    • I have made a Kernel especially for this dataset, which uses Clustering, Regression, and Time-Series techniques for this dataset. You can check it out here.

    Important Points

    • In this dataset, we are considering only the Yellow Taxis Data, for the months of Jan 2015 & Jan-mar 2016.
    • If you go over to the website of NYC TLC, and download any of the CSV files, you will find a different format of these files. This is because, the TLC regularly adds more data, alongside updating the existing one.
    • One of the key changes that they have made to their data is that, instead of providing the pickup & dropoff coordinates, they have divided the NYC into regions and indexed those regions, and in the CSV files, they have provided these indices.
    • Due to this reason only, I have made this dataset using the previous version of the CSV files. This dataset allows me to practice my clustering knowledge alongside my time-series knowledge.
    • If you want to leave out the clustering part, then just go over to their website, and download the new CSV files.

    Attributes

    ...

    Field NameDescription
    VendorID A code indicating the TPEP provider that provided the record.
    1. Creative Mobile Technologies
    2. VeriFone Inc.
    tpep_pickup_datetimeThe date and time when the meter was engaged.
    tpep_dropoff_datetimeThe date and time when the meter was disengaged.
    Passenger_countThe number of passengers in the vehicle. This is a driver-entered value.
    Trip_distanceThe elapsed trip distance in miles reported by the taximeter.
    Pickup_longitudeLongitude where the meter was engaged.
    Pickup_latitudeLatitude where the meter was engaged.
    RateCodeIDThe final rate code in effect at the end of the trip.
    1. Standard rate
    2. JFK
    3. Newark
    4. Nassau or Westchester
    5. Negotiated fare
    6. Group ride
    Store_and_fwd_flagThis flag indicates whether the trip record was held in vehicle memory before sending to the vendor,
    aka “store and forward,” because the vehicle did not have a connection to the server.
    Y= store and forward trip
    N= not a store and forward trip
    Dropoff_longitudeLongitude where the meter was disengaged.
    Dropoff_ latitudeLatitude where the meter was disengaged.
    Payment_typeA numeric code signifying how the passenger paid for the trip.
    1. Credit card
    2. Cash
    3. No charge
    4. Dispute
    5. Unknown
    6. Voided trip
    Fare_amountThe time-and-distance fare calculated by the meter.
    ExtraMiscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.
    MTA_tax0.50 MTA tax that is automatically triggered based on the metered rate in use.
    Improvement_surcharge0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.
  14. Health Outcomes and Socioeconomic Factors

    • kaggle.com
    zip
    Updated Dec 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Health Outcomes and Socioeconomic Factors [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-trends-in-health-outcomes-and-socioec/code
    Explore at:
    zip(355475 bytes)Available download formats
    Dataset updated
    Dec 3, 2022
    Authors
    The Devastator
    Description

    Health Outcomes and Socioeconomic Factors

    A Study of US County Data

    By Data Exercises [source]

    About this dataset

    This dataset contains a wealth of health-related information and socio-economic data aggregated from multiple sources such as the American Community Survey, clinicaltrials.gov, and cancer.gov, covering a variety of US counties. Your task is to use this collection of data to build an Ordinary Least Squares (OLS) regression model that predicts the target death rate in each county. The model should incorporate variables related to population size, health insurance coverage, educational attainment levels, median incomes and poverty rates. Additionally you will need to assess linearity between your model parameters; measure serial independence among errors; test for heteroskedasticity; evaluate normality in the residual distribution; identify any outliers or missing values and determine how categories variables are handled; compare models through implementation with k=10 cross validation within linear regressions as well as assessing multicollinearity among model parameters. Examine your results by utilizing statistical agreements such as R-squared values and Root Mean Square Error (RMSE) while also interpreting implications uncovered by your analysis based on health outcomes compared to correlates among demographics surrounding those effected most closely by land structure along geographic boundaries throughout the United States

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides data on health outcomes, demographics, and socio-economic factors for various US counties from 2010-2016. It can be used to uncover trends in health outcomes and socioeconomic factors across different counties in the US over a six year period.

    The dataset contains a variety of information including statefips (a two digit code that identifies the state), countyfips (a three digit code that identifies the county), avg household size, avg annual count of cancer cases, average deaths per year, target death rate, median household income, population estimate for 2015, poverty percent study per capita binned income as well as demographic information such as median age of male and female population percent married households adults with no high school diploma adults with high school diploma percentage with some college education bachelor's degree holders among adults over 25 years old employed persons 16 and over unemployed persons 16 and over private coverage available private coverage available alone temporary private coverage available public coverage available public coverage available alone percentages of white black Asian other race married households and birth rate.

    Using this dataset you can build a multivariate ordinary least squares regression model to predict “target_deathrate”. You will also need to implement k-fold (k=10) cross validation to best select your model parameters. Model diagnostics should be performed in order to assess linearity serial independence heteroskedasticity normality multicollinearity etc., while outliers missing values or categorical variables will also have an effect your model selection process. Finally it is important to interpret the resulting models within their context based upon all given factors associated with it such as outliers missing values demographic changes etc., before arriving at a meaningful conclusion which may explain trends in health outcomes and socioeconomic factors found within this dataset

    Research Ideas

    • Analysis of factors influencing target deathrates in different US counties.
    • Prediction of the effects of varying poverty levels on health outcomes in different US counties.
    • In-depth analysis of how various socio-economic factors (e.g., median income, educational attainment, etc.) contribute to overall public health outcomes in US counties

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. -...

  15. Student Performance and Clustering Dataset

    • kaggle.com
    zip
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Khubaib Ahmad (2025). Student Performance and Clustering Dataset [Dataset]. https://www.kaggle.com/datasets/muhammadkhubaibahmad/student-performance-and-clustering-dataset
    Explore at:
    zip(7906 bytes)Available download formats
    Dataset updated
    Oct 24, 2025
    Authors
    Muhammad Khubaib Ahmad
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Description:

    This dataset contains performance, attendance, and participation metrics of 300 students, intended for clustering, exploratory data analysis (EDA), and educational analytics. It can be used to explore relationships between quizzes, exams, GPA, attendance, lab sessions, and other academic indicators.

    This dataset is ideal for unsupervised learning exercises, clustering students based on performance patterns, or for demonstrating educational analytics workflows.

    Note: This is a small dataset (300 rows) and is not suitable for training large-scale supervised models.

    File Information:

    File Name: student_performance.csv Format: CSV (Comma-Separated Values) Rows: 300 Columns: 16 features + optional identifier columns

    Column Details: | Column Name | Type | Description | | ----------------------- | ------- | -------------------------------------------------------- | | student_id | int64 | Unique student identifier | | name | object | Student name (should be anonymized before use) | | age | int64 | Age of the student (years) | | gender | object | Gender of the student | | quiz1_marks | float64 | Marks obtained in Quiz 1 (0–10) | | quiz2_marks | float64 | Marks obtained in Quiz 2 (0–10) | | quiz3_marks | float64 | Marks obtained in Quiz 3 (0–10) | | total_assignments | int64 | Total number of assignments assigned | | assignments_submitted | float64 | Number of assignments submitted (NaN in current dataset) | | midterm_marks | float64 | Marks obtained in midterm exam (0–30) | | final_marks | float64 | Marks obtained in final exam (0–50) | | previous_gpa | float64 | GPA from previous semester (0–4 scale) | | total_lectures | int64 | Total number of lectures scheduled | | lectures_attended | int64 | Number of lectures attended | | total_lab_sessions | int64 | Total lab sessions assigned | | labs_attended | int64 | Number of lab sessions attended |

    Suggested Usage:

    • Clustering: Group students based on performance metrics, attendance, and GPA trends.
    • Exploratory Data Analysis (EDA): Analyze correlations between attendance, quizzes, midterm/final scores, and GPA.
    • Educational Analytics: Derive participation rates, average scores, and performance trends.
    • Feature Engineering: Compute additional metrics like average quiz score, total participation, or engagement ratios. Preprocessing Notes:
    • Drop or impute assignments_submitted if using for ML.
    • Anonymize name to maintain privacy.
    • Categorical variable gender can be label encoded or one-hot encoded if needed.

    License: CC BY 4.0 – Free to use, share, and adapt with proper attribution.

    Citation: Muhammad Khubaib Ahmad, "Student Performance and Clustering Dataset", 2025, Kaggle. DOI: https://doi.org/10.34740/kaggle/dsv/13489035

  16. Call Centre Queue Simulation

    • kaggle.com
    zip
    Updated Sep 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donovan Bangs (2022). Call Centre Queue Simulation [Dataset]. https://www.kaggle.com/datasets/donovanbangs/call-centre-queue-simulation
    Explore at:
    zip(841475 bytes)Available download formats
    Dataset updated
    Sep 20, 2022
    Authors
    Donovan Bangs
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Call Centre Queue Simulation

    A simulated call centre dataset and notebook, designed to be used as a classroom / tutorial dataset for Business and Operations Analytics.

    This notebook details the creation of simulated call centre logs over the course of one year. For this dataset we are imagining a business whose lines are open from 8:00am to 6:00pm, Monday to Friday. Four agents are on duty at any given time and each call takes an average of 5 minutes to resolve.

    The call centre manager is required to meet a performance target: 90% of calls must be answered within 1 minute. Lately, the performance has slipped. As the data analytics expert, you have been brought in to analyze their performance and make recommendations to return the centre back to its target.

    The dataset records timestamps for when a call was placed, when it was answered, and when the call was completed. The total waiting and service times are calculated, as well as a logical for whether the call was answered within the performance standard.

    Discrete-Event Simulation

    Discrete-Event Simulation allows us to model real calling behaviour with a few simple variables.

    • Arrival Rate
    • Service Rate
    • Number of Agents

    The simulations in this dataset are performed using the package simmer (Ucar et al., 2019). I encourage you to visit their website for complete details and fantastic tutorials on Discrete-Event Simulation.

    Ucar I, Smeets B, Azcorra A (2019). “simmer: Discrete-Event Simulation for R.” Journal of Statistical Software, 90(2), 1–30.

    For source code and simulation details, view the cross-posted GitHub notebook and Shiny app.

  17. Synthetic Sri Lanka Fuel Prices 2010–2025

    • kaggle.com
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dewmi Nimnaadi (2025). Synthetic Sri Lanka Fuel Prices 2010–2025 [Dataset]. https://www.kaggle.com/datasets/dewminimnaadi/synthetic-sri-lanka-fuel-prices-20102025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2025
    Dataset provided by
    Kaggle
    Authors
    Dewmi Nimnaadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Sri Lanka
    Description

    This dataset contains synthetically generated monthly fuel price data for Sri Lanka from January 2010 to August 2025, covering five major fuel types:

    • Petrol 92
    • Petrol 95
    • Diesel Auto
    • Diesel Super
    • Kerosene

    Prices are not real — they are created using a statistical simulation model that incorporates realistic market behaviors and macroeconomic effects such as:

    • Global oil price fluctuations
    • Exchange rate changes
    • Policy revisions and tax adjustments
    • Seasonal demand shifts
    • Crisis-related volatility (e.g., synthetic 2020 pandemic dip, 2022 FX/debt crisis spike)

    The dataset is designed for educational, research, and data science practice purposes — ideal for time-series forecasting, trend visualization, and policy simulation exercises.

    How to Use

    You can use this dataset for:

    • Time-Series Forecasting – Build ARIMA, Prophet, LSTM, or XGBoost models to predict future fuel prices.
    • 📈 Policy Impact Analysis – Simulate how events affect fuel prices.
    • 📊 Data Visualization – Create dashboards showing trends by fuel type.
    • 🧪 Feature Engineering – Generate lag features, moving averages, seasonal indicators, and volatility measures.
    • 🔍 Categorical Analysis – Study correlations between change_reason and price changes.

    Note: Missing values are included in certain months for some fuel types to simulate real-world data gaps. This allows testing of imputation and data cleaning techniques.

    Data Dictionary

    ColumnDescriptionType / ValuesExample
    dateMonth start date (YYYY-MM-DD)Date2022-07-01
    fuel_typeFuel typePetrol_92, Petrol_95, Diesel_Auto, Diesel_Super, KerosenePetrol_92
    price_lkr_per_litreSynthetic retail price per litre (LKR)Integer, may have missing values470
    change_reasonMain driver of price changeglobal_oil, fx_rate, policy_revision, tax_adjustment, seasonalpolicy_revision
    notesAdditional contextStringSynthetic monthly price index; not real market data.

    Example Uses

    • Forecast price_lkr_per_litre using historical patterns.
    • Compare volatility between fuel types.
    • Visualize the synthetic 2022 “crisis spike” and its recovery trend.
    • Apply missing value imputation methods for price gaps.

    Important Notes

    • This dataset is entirely synthetic — it is not sourced from CEYPETCO, Lanka IOC, or any real provider.
    • It is intended only for learning and research purposes.
    • Missing values are intentional to mimic incomplete real-world datasets.
    • Price patterns are designed to be realistic but do not reflect real historical prices.

    💬 Feel free to discuss anything related to this dataset in the comments — suggestions, ideas, or ways to improve it are welcome!

  18. Bike Store Relational Database | SQL

    • kaggle.com
    zip
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dillon Myrick (2023). Bike Store Relational Database | SQL [Dataset]. https://www.kaggle.com/datasets/dillonmyrick/bike-store-sample-database
    Explore at:
    zip(94412 bytes)Available download formats
    Dataset updated
    Aug 21, 2023
    Authors
    Dillon Myrick
    Description

    This is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.

    Database Diagram:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">

    Terms of Use

    The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses

  19. Football DataSet +96k matches (18 leagues)

    • kaggle.com
    zip
    Updated May 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Gębala (2023). Football DataSet +96k matches (18 leagues) [Dataset]. https://www.kaggle.com/datasets/bastekforever/complete-football-data-89000-matches-18-leagues
    Explore at:
    zip(9816722 bytes)Available download formats
    Dataset updated
    May 2, 2023
    Authors
    Sebastian Gębala
    Description

    The ultimate Football database for data analysis and machine learning

    What you get:

    +96,000 matches with detailed minute-by-minute history of the single game + players name (goals, yellow/red cards, penalty, var, penalty missed ect.) - factor INC Season 2021-2022 included

    18 European Leagues from 10 Countries with their lead championship: - premier-league - 7600 matches (seasons 2002-2022) - laliga - 7220 matches (seasons 2003-2022) - serie-a - 7150 matches (seasons 2003-2022) - ligue-1 - 6757 matches (seasons 2004-2022) - championship - 6684 matches (seasons 2010-2022) - league-one - 6440 matches (seasons 2010-2022) - bundesliga - 5838 matches (seasons 2003-2022) - league-two - 6015 matches (seasons 2011-2022) - eredivisie - 5776 matches (seasons 2004-2022) - laliga2 - 5519 matches (seasons 2010-2022) - serie-b - 5286 matches (seasons 2010-2022) - ligue-2 - 4470 matches (seasons 2010-2022) - super-lig - 3504 matches (seasons 2010-2022) - jupiler-league - 3756 matches (seasons 2010-2022) - fortuna-1-liga - 3687 matches (seasons 2010-2022) - 2-bundesliga - 3503 matches (seasons 2010-2022) - liga-portugal - 3414 matches (seasons 2010-2022) - pko-bp-ekstraklasa - 3338 matches (seasons 2010-2022)

    Betting odds +winning betting odds Statistics Detailed match events (goal types, possession, corner, cross, fouls, cards etc…) for +96,000 matches

    Why this data?

    You can easily find data about football matches but they are usually scattered across different websites and those data in my opinion are missing with good shaped game's events. Therefore the most usefull part of this DataSet is factor INC which is in fact the register of game events minute-by-minute (goals, cards, vars, missed penalties ect.) collected in python list. Example Swansea-Reading:

    "INC": [
          "08' Yellow_Away - Griffin A.",
          "12' Yellow_Away - Khizanishvili Z.",
          "12' Yellow_Home - Borini F.",
          "21' Goal_Home - Penalty Sinclair S.(Penalty )",
          "22' Goal_Home - Sinclair S.(Dobbie S.)",
          "39' Yellow_Away - McAnuff J.",
          "40' Goal_Home - Dobbie S.",
          "46' Red_Card_Away - Tabb J.",
          "49' Own_Away - Allen J.()",
          "54' Yellow_Home - Allen J.",
          "57' Goal_Away - Mills M.(McAnuff J.)",
          "80' Goal_Home - Sinclair S. (Penalty)",
          "82' Yellow_Home - Gower M."
        ],
    

    Those data are scraped form one of the livesscores web page provider. I own program written in python which can scrape data from any league all around the world (but anyway it takes time and the program itself needs constant updating as the providers changing source code).

    Locally my Dataset is larger because it contains +100 factors, i.e. it contains infos about previous game with all infos about that games and more additional infos. I shortend the DataSet uploaded on kaggle to make it simpler and more understandable.

    License

    I must insist that you do not make any commercial use of the data. I give this DataSet to your none-commercial use.

    Cooperation

    sebastian.gebala@gmail.com

  20. NHL Draft Hockey Player Data (1963 - 2022)

    • kaggle.com
    zip
    Updated Aug 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matt OP (2022). NHL Draft Hockey Player Data (1963 - 2022) [Dataset]. https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022
    Explore at:
    zip(350802 bytes)Available download formats
    Dataset updated
    Aug 3, 2022
    Authors
    Matt OP
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset contains every player drafted in the NHL Draft from (1963 - 2022).

    The data was collected from Sports Reference then cleaned for data analysis.

    Tabular data includes: - year: Year of draft - overall_pick: Overall pick player was drafted - team: Team player drafted to - player: Player drafted - nationality: Nationality of player drafted - position: Player position - age: Player age - to_year: Year draft pick played to - amateur_team: Amateur team drafted from - games_played: Total games played by player (non-goalie) - goals: Total goals - assists: Total assists - points: Total points - plus_minus: Plus minus of player - penalties_minutes: Penalties in minutes - goalie_games_played: Goalie games played - goalie_wins - goalie_losses - goalie_ties_overtime: Ties plus overtime/shootout losses - save_percentage - goals_against_average - point_shares

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ísis Santos Costa (2025). 🎓 365DS Practice Exams • People Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/isissantoscosta/365ds-practice-exams-people-analytics-dataset
Organization logo

🎓 365DS Practice Exams • People Analytics Dataset

People Analytics Data used in « 365 Data Science Practice Exames • SQL »

Explore at:
zip(61775349 bytes)Available download formats
Dataset updated
May 20, 2025
Authors
Ísis Santos Costa
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

This dataset has been uploaded to Kaggle on the occasion of solving questions of the 365 Data Science • Practice Exams: SQL curriculum, a set of free resources designed to help test and elevate data science skills. The dataset consists of a synthetic, relational collection of data structured to simulate common employee and organizational data scenarios, ideal for practicing SQL queries and data analysis skills in a People Analytics context.

The dataset contains the following tables:

departments.csv: List of all company departments. dept_emp.csv: Historical and current assignments of employees to departments. dept_manager.csv: Historical and current assignments of employees as department managers. employees.csv: Core employee demographic information. employees.db: A SQLite database containing all the relational tables from the CSV files. salaries.csv: Historical salary records for employees. titles.csv: Historical job titles held by employees.

Usage

The dataset is ideal for practicing SQL queries and data analysis skills in a People Analytics context. It serves applications on both general Data Analytics, and also Time Series Analysis.

A practical application is presented on the 🎓 365DS Practice Exams • SQL notebook, which covers in detail answers to the questions of SQL Practice Exams 1, 2, and 3 on the 365DS platform, especially ilustrating the usage and the value of SQL procedures and functions.

Acknowledgements & Data Origin

This dataset has a rich lineage, originating from academic research and evolving through various formats to its current relational structure:

Original Authors

The foundational dataset was authored by Prof. Dr. Fusheng Wang 🔗 (then a PhD student at the University of California, Los Angeles - UCLA) and his advisor, Prof. Dr. Carlo Zaniolo 🔗 (UCLA). This work is primarily described in their paper:

Relational Conversion

It was originally distributed as an .xml file. Giuseppe Maxia (known as @datacharmer on GitHub🔗 and LinkedIn🔗, as well as here on Kaggle) converted it into its relational form and subsequently distributed it as a .sql file, making it accessible for relational database use.

Kaggle Upload

This .sql version was then loaded to Kaggle as the « Employees Dataset » by Mirza Huzaifa🔗 on February 5th, 2023.

Search
Clear search
Close search
Google apps
Main menu