24 datasets found

🎓 365DS Practice Exams • People Analytics Dataset
kaggle.com
zip
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ísis Santos Costa (2025). 🎓 365DS Practice Exams • People Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/isissantoscosta/365ds-practice-exams-people-analytics-dataset
Explore at:
zip(61775349 bytes)Available download formats
Dataset updated
May 20, 2025
Authors
Ísis Santos Costa
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset has been uploaded to Kaggle on the occasion of solving questions of the 365 Data Science • Practice Exams: SQL curriculum, a set of free resources designed to help test and elevate data science skills. The dataset consists of a synthetic, relational collection of data structured to simulate common employee and organizational data scenarios, ideal for practicing SQL queries and data analysis skills in a People Analytics context.

The dataset contains the following tables:

departments.csv: List of all company departments. dept_emp.csv: Historical and current assignments of employees to departments. dept_manager.csv: Historical and current assignments of employees as department managers. employees.csv: Core employee demographic information. employees.db: A SQLite database containing all the relational tables from the CSV files. salaries.csv: Historical salary records for employees. titles.csv: Historical job titles held by employees.

Usage

The dataset is ideal for practicing SQL queries and data analysis skills in a People Analytics context. It serves applications on both general Data Analytics, and also Time Series Analysis.

A practical application is presented on the 🎓 365DS Practice Exams • SQL notebook, which covers in detail answers to the questions of SQL Practice Exams 1, 2, and 3 on the 365DS platform, especially ilustrating the usage and the value of SQL procedures and functions.

Acknowledgements & Data Origin

This dataset has a rich lineage, originating from academic research and evolving through various formats to its current relational structure:

Original Authors

The foundational dataset was authored by Prof. Dr. Fusheng Wang 🔗 (then a PhD student at the University of California, Los Angeles - UCLA) and his advisor, Prof. Dr. Carlo Zaniolo 🔗 (UCLA). This work is primarily described in their paper:

Wang, F., & Zaniolo, C. (2004). Publishing and Querying the Histories of Archived Relational Databases in XML. DOI:10.1109/WISE.2003.1254473.

Relational Conversion

It was originally distributed as an .xml file. Giuseppe Maxia (known as @datacharmer on GitHub🔗 and LinkedIn🔗, as well as here on Kaggle) converted it into its relational form and subsequently distributed it as a .sql file, making it accessible for relational database use.

Kaggle Upload

This .sql version was then loaded to Kaggle as the « Employees Dataset » by Mirza Huzaifa🔗 on February 5th, 2023.

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Healthcare Device Data Analysis with R
kaggle.com
zip
Updated Oct 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stanley888cy (2021). Healthcare Device Data Analysis with R [Dataset]. https://www.kaggle.com/stanley888cy/google-project-02
Explore at:
zip(353177 bytes)Available download formats
Dataset updated
Oct 7, 2021
Authors
stanley888cy
Description
Context

Hi. This is my data analysis project and also try using R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of data from health monitoring device. For detailed background story, please check the pdf file (Case 02.pdf) for reference.

In this case study, I use personal health tracker data from Fitbit to evaluate the how the usage of health tracker device, and then determine if there are any trends or patterns.

My data analysis will be focus in 2 area: exercise activity and sleeping habit. Exercise activity will be a study of relationship between activity type and calories consumed, while sleeping habit will be identify any the pattern of user sleeping. In this analysis, I will also try to use some linear regression model, so that the data can be explain in a quantitative way and make prediction easier.

I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.

Stanley Cheng 2021-10-07
H
Political Analysis Using R: Example Code and Data, Plus Data for Practice...
dataverse.harvard.edu
search.dataone.org
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ARKOTI
Dataset updated
Apr 28, 2020
Dataset provided by
Harvard Dataverse
Authors
Jamie Monogan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Breast Cancer Dataset
kaggle.com
zip
Updated Jun 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ms. Nancy Al Aswad (2022). Breast Cancer Dataset [Dataset]. https://www.kaggle.com/datasets/nancyalaswad90/breast-cancer-dataset/code
Explore at:
zip(49781 bytes)Available download formats
Dataset updated
Jun 17, 2022
Authors
Ms. Nancy Al Aswad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
What is Breast Cancer Dataset?

Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.

.

https://user-images.githubusercontent.com/36210723/182301443-382b14e1-71c1-46ac-88f5-e72a9b2083e7.jpg" alt="cancer-1">

.

How to use this dataset

The key challenge against its detection is how to classify tumors into malignant (cancerous) or benign(non-cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.

Acknowledgments

When we use this dataset in our research, we credit the authors as :

License : CC BY 4.0.

This data set is taken from https://data.world/health/breast-cancer-wisconsin by the Donor: Nick Street and the Source: UCI - Machine Learning Repository.

The main idea for uploading this dataset is to practice data analysis with my students, as I am working in college and want my student to train our studying ideas in a big dataset, It may be not up to date and I mention the collecting years, but it is a good resource of data to practice

E-commerce_dataset

kaggle.com

zip

Updated Nov 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Abhay Ayare (2025). E-commerce_dataset [Dataset]. https://www.kaggle.com/datasets/abhayayare/e-commerce-dataset

Explore at:

zip(644123 bytes)Available download formats

Dataset updated

Nov 16, 2025

Authors

Abhay Ayare

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

E-commerce_dataset

This dataset is a synthetic yet realistic E-commerce retail dataset generated programmatically using Python (Faker + NumPy + Pandas).
It is designed to closely mimic real-world online shopping behavior, user patterns, product interactions, seasonal trends, and marketplace events.

You can use this dataset for:

Machine Learning & Deep Learning
Recommender Systems
Customer Segmentation
Sales Forecasting
A/B Testing
E-commerce Behaviour Analysis
Data Cleaning / Feature Engineering Practice
SQL practice

📁Dataset Contents

The dataset contains 6 CSV files: ~~~ File Rows Description users.csv ~10,000 User profiles, demographics & signup info products.csv ~2,000 Product catalog with rating and pricing orders.csv ~20,000 Order-level transactions order_items.csv ~60,000 Items purchased per order reviews.csv ~15,000 Customer-written product reviews events.csv ~80,000 User event logs: view, cart, wishlist, purchase ~~~

🧬 Data Dictionary

1. Users (users.csv)
Column Description
user_id Unique user identifier
name  Full customer name
email  Email (synthetic, no real emails)
gender Male / Female / Other
city  City of residence
signup_date Account creation date

2. Products (products.csv)
Column Description
product_id Unique product identifier
product_name  Product title
category  Electronics, Clothing, Beauty, Home, Sports, etc.
price  Actual selling price
rating Average product rating

3. Orders (orders.csv)
Column Description
order_id  Unique order identifier
user_id User who placed the order
order_date Timestamp of the order
order_status  Completed / Cancelled / Returned
total_amount  Total order value

4. Order Items (order_items.csv)
Column Description
order_item_id  Unique identifier
order_id  Associated order
product_id Purchased product
quantity  Quantity purchased
item_price Price per unit

5. Reviews (reviews.csv)
Column Description
review_id  Unique review identifier
user_id User who submitted review
product_id Reviewed product
rating 1–5 star rating
review_text Short synthetic review
review_date Submission date

6. Events (events.csv)
Column Description
event_id  Unique event identifier
user_id User performing event
product_id Viewed/added/purchased product
event_type view/cart/wishlist/purchase
event_timestamp Timestamp of event

🧠 Possible Use Cases (Ideas & Projects)

🔍 Machine Learning

Customer churn prediction
Review sentiment analysis (NLP)
Recommendation engines
Price optimization models
Demand forecasting (Time-series)

📦 Business Analytics

Market basket analysis
RFM segmentation
Cohort analysis
Funnel conversion tracking
A/B testing simulations

🧮 SQL Practice

Joins
Window functions
Aggregations
CTE-based funnels
Complex queries

🛠 How the Dataset Was Generated

The dataset was generated entirely in Python using:

Faker for realistic user and review generation
NumPy for probability-based event modeling
Pandas for data processing

Custom logic for:

demand variation
user behavior simulation
return/cancel probabilities
seasonal order timestamp distribution
The dataset does not include any real personal data.
Everything is generated synthetically.

⚠️ License

This dataset is released under CC BY 4.0 — free to use for:
Research
Education
Commercial projects
Kaggle competitions
Machine learning pipelines
Just provide attribution.

⭐ If you found this dataset helpful, please:

Upvote the dataset
Leave a comment
Share your notebooks/notebooks using it

Premier League Matches Dataset - 2021 to 2025

kaggle.com

Updated Jul 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

armin2080 (2025). Premier League Matches Dataset - 2021 to 2025 [Dataset]. https://www.kaggle.com/datasets/armin2080/premier-league-matches-dataset-2021-to-2025

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 26, 2025

Dataset provided by

Kaggle

Authors

armin2080

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains detailed information on all Premier League matches played between the 2021 and 2025 seasons. It includes match dates, times, venues, results, goals scored (gf), goals against (ga), expected goals (xg), possession percentages, attendance figures, team formations, referees, and other relevant statistics. This data can be used for analysis, modeling predictions, or exploring trends in Premier League football.

Columns:

Column Name	Description
`date`	The date of the match (format: MM/DD/YYYY)
`time`	The time of the match (in 24-hour format)
`comp`	Competition name (e.g., Premier League)
`round`	Match round or week number
`day`	Day of the week when the match was played
`venue`	Venue where the match took place
`result`	Result of the match (W for Win, D for Draw, L for Loss)
`gf`	Goals For - number of goals scored by the home team
`ga`	Goals Against - number of goals conceded by the home team
`opponent`	Name of the opposing team
`xg`	Expected Goals for the home team
`xga`	Expected Goals Against for the home team
`poss`	Possession percentage
`attendance`	Number of spectators attending the match
`captain`	Captain's name for the home team
`formation`	Formation used by the home team
`opp formation`	Formation used by the opponent
`referee`	Referee officiating the match
`match report`	Link or reference to a detailed match report
`notes`	Additional notes regarding specific matches
`sh`	Total shots taken by the home team
`sot`	Shots on target by the home team
`dist`	Average distance covered in shots (in meters)
`fk`	Number of free kicks awarded to the home team
`pk`	Number of penalties awarded to the home team
`pkatt`	Number of penalties attempted by the home team
`team`	Name of the home team
`season`	Season during which matches were played

Customer Shopping Trends Dataset
kaggle.com
zip
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
Explore at:
zip(149846 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
Sourav Banerjee
Description
Context

The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

Content

This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

Dataset Glossary (Column-wise)

Customer ID - Unique identifier for each customer

Age - Age of the customer

Gender - Gender of the customer (Male/Female)

Item Purchased - The item purchased by the customer

Category - Category of the item purchased

Purchase Amount (USD) - The amount of the purchase in USD

Location - Location where the purchase was made

Size - Size of the purchased item

Color - Color of the purchased item

Season - Season during which the purchase was made

Review Rating - Rating given by the customer for the purchased item

Subscription Status - Indicates if the customer has a subscription (Yes/No)

Shipping Type - Type of shipping chosen by the customer

Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)

Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)

Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction

Payment Method - Customer's most preferred payment method

Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

Structure of the Dataset

https://i.imgur.com/6UEqejq.png" alt="">

Acknowledgement

This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

Cover Photo by: Freepik

Thumbnail by: Clothing icons created by Flat Icons - Flaticon
SQL Case Study for Data Analysts
kaggle.com
zip
Updated Jan 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShravyaShetty1 (2025). SQL Case Study for Data Analysts [Dataset]. https://www.kaggle.com/datasets/shravyashetty1/sql-basic-case-study
Explore at:
zip(59519 bytes)Available download formats
Dataset updated
Jan 29, 2025
Authors
ShravyaShetty1
Description
This dataset is a practical SQL case study designed for learners who are looking to enhance their SQL skills in analyzing sales, products, and marketing data. It contains several SQL queries related to a simulated business database for product sales, marketing expenses, and location data. The database consists of three main tables: Fact, Product, and Location.

Objective of the Case Study: The purpose of this case study is to provide learners with a variety of practical SQL exercises that involve real-world business problems. The queries explore topics such as:

Aggregating data (e.g., sum, count, average)

Filtering and sorting data

Grouping and joining multiple tables

Using SQL functions like AVG(), COUNT(), SUM(), and MIN/MAX()

Handling advanced SQL features such as row numbering, transactions, and stored procedures
NBA-National Board of Accreditation-dataset
kaggle.com
zip
Updated Jul 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Ardeshna (2023). NBA-National Board of Accreditation-dataset [Dataset]. https://www.kaggle.com/datasets/shivamardeshna/nba-national-board-of-accreditation-dataset/code
Explore at:
zip(11342 bytes)Available download formats
Dataset updated
Jul 29, 2023
Authors
Shivam Ardeshna
Description
The description is a more detailed explanation of the dataset's content, source, and potential use cases. It helps users understand the dataset's relevance and usefulness for their projects. Here's an example description for the NBA players performance dataset: Description Example: "This dataset contains comprehensive performance statistics for NBA players from the 2020-2021 season. It includes player-level data such as points scored, rebounds, assists, field goal percentage, free throw percentage, and more. The data was collected from official NBA records and other reputable sources.

The dataset can be used for various data analysis and machine learning tasks related to NBA player performance. Analysts and researchers can explore player trends, compare individual performances, identify standout players, and investigate correlations between different performance metrics.

Whether you're an NBA enthusiast, a data scientist, or a basketball coach, this dataset provides valuable insights into the statistical aspects of player performance in the 2020-2021 NBA season. It is ideal for data-driven research, building predictive models, and gaining a deeper understanding of player contributions to their teams."

E-commerce Customer Behaviour Dataset

kaggle.com

zip

Updated Sep 27, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Paul Samuel W E (2025). E-commerce Customer Behaviour Dataset [Dataset]. https://www.kaggle.com/datasets/paulsamuelwe/e-commerce-customer-behaviour-dataset

Explore at:

zip(10257 bytes)Available download formats

Dataset updated

Sep 27, 2025

Authors

Paul Samuel W E

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

E-Commerce Customer Behavior Dataset

The E-Commerce Customer Behavior Dataset is a synthetic dataset designed to capture the full spectrum of customer interactions with an online retail platform. Created by Gretel AI for educational and research purposes, it provides a comprehensive view of how customers browse, purchase, and review products. The dataset is ideal for data science practice, machine learning modeling, and exploratory analytics.

Features and Variables

Customer ID

Unique identifier for each customer.
Allows tracking customer behavior across multiple features.

Age

Numeric value representing customer age.
Useful for demographic analysis and segmentation.

Gender

Categorical: Male, Female, Other.
Enables study of gender-specific purchasing patterns.

Location

Geographic location of the customer (city or region).
Supports regional analysis and location-based marketing insights.

Annual Income

Customer’s annual income in USD.
Key for understanding purchasing power and spending habits.

Purchase History

Structured list of products purchased, including:
- Date of purchase
- Product category
- Price
Allows analysis of repeat purchases, product popularity, and category trends.

Browsing History

Records of products viewed by the customer with timestamps.
Useful to study engagement patterns, interests, and conversion likelihood.

Product Reviews

Textual reviews and ratings (1–5 stars) provided by customers.
Enables qualitative analysis of customer satisfaction and sentiment.

Time on Site

Total duration (in minutes) spent by the customer per session.
Indicator of user engagement and browsing intensity.

Data Summary

Feature	Range / Distribution	Notes
Age	24–65	Mean: 40, Std: 11
Gender	Female 52%, Male 36%, Other 12%	Categorical
Location	Most common: City D (24%), City E (12%), Other (64%)	Regional trends
Annual Income	$40,000–$100,000	Mean: $65,800, Std: $16,900
Time on Site	32.5–486.3 mins	Mean: 233, Std: 109

Example Entries

Purchase History

[
 {"Date": "2022-03-05", "Category": "Clothing", "Price": 34.99},
 {"Date": "2022-02-12", "Category": "Electronics", "Price": 129.99},
 {"Date": "2022-01-20", "Category": "Home & Garden", "Price": 29.99}
]

Browsing History

[
 {"Timestamp": "2022-03-10T14:30:00Z"},
 {"Timestamp": "2022-03-11T09:45:00Z"},
 {"Timestamp": "2022-03-12T16:20:00Z"}
]

Product Review

{
 "Review Text": "Excellent product, highly recommend!",
 "Rating": 5
}

Methodology

This dataset was synthetically generated using machine learning techniques to simulate realistic customer behavior:

Pattern Recognition Identifying trends and correlations observed in real-world e-commerce datasets.
Synthetic Data Generation Producing data points for all features while preserving realistic relationships.
Controlled Variation Introducing diversity to reflect a wide range of customer behaviors while maintaining logical consistency.

Potential Use Cases

Customer segmentation and profiling
Predictive modeling of purchases and churn
Recommender system development
Sentiment analysis and natural language processing on reviews
Engagement and behavioral analytics

License

CC BY 4.0 (Attribution 4.0 International) Free to use for educational and research purposes with attribution.

Important Notes

This dataset is fully synthetic — it contains no personal or sensitive information.
Ideal for learners, educators, and researchers looking to practice analytics and machine learning in a realistic e-commerce context.

NYC Yellow Taxi Trip Data

kaggle.com

zip

Updated Dec 9, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Elemento (2021). NYC Yellow Taxi Trip Data [Dataset]. https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data

Explore at:

zip(1915626894 bytes)Available download formats

Dataset updated

Dec 9, 2021

Authors

Elemento

License

https://www.usa.gov/government-works/https://www.usa.gov/government-works/

Area covered

New York

Description

Context

New York City (NYC) Taxi & Limousine Commission (TLC) keeps data from all its cabs, and it is freely available to download from its official website. You can access it here. Now, the TLC primarily keeps and manages data for 4 different types of vehicles: - Yellow Taxi: Yellow Medallion Taxicabs: These are the famous NYC yellow taxis that provide transportation exclusively through street hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged. - Green Taxi: Street Hail Livery: The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides. - For-Hire Vehicles (FHVs): FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

Complimentary Kernel

I have made a Kernel especially for this dataset, which uses Clustering, Regression, and Time-Series techniques for this dataset. You can check it out here.

Important Points

In this dataset, we are considering only the Yellow Taxis Data, for the months of Jan 2015 & Jan-mar 2016.
If you go over to the website of NYC TLC, and download any of the CSV files, you will find a different format of these files. This is because, the TLC regularly adds more data, alongside updating the existing one.
One of the key changes that they have made to their data is that, instead of providing the pickup & dropoff coordinates, they have divided the NYC into regions and indexed those regions, and in the CSV files, they have provided these indices.
Due to this reason only, I have made this dataset using the previous version of the CSV files. This dataset allows me to practice my clustering knowledge alongside my time-series knowledge.
If you want to leave out the clustering part, then just go over to their website, and download the new CSV files.

Attributes

...

Field Name	Description
VendorID	A code indicating the TPEP provider that provided the record. Creative Mobile Technologies VeriFone Inc.
tpep_pickup_datetime	The date and time when the meter was engaged.
tpep_dropoff_datetime	The date and time when the meter was disengaged.
Passenger_count	The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance	The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude	Longitude where the meter was engaged.
Pickup_latitude	Latitude where the meter was engaged.
RateCodeID	The final rate code in effect at the end of the trip. Standard rate JFK Newark Nassau or Westchester Negotiated fare Group ride
Store_and_fwd_flag	This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
Dropoff_longitude	Longitude where the meter was disengaged.
Dropoff_ latitude	Latitude where the meter was disengaged.
Payment_type	A numeric code signifying how the passenger paid for the trip. Credit card Cash No charge Dispute Unknown Voided trip
Fare_amount	The time-and-distance fare calculated by the meter.
Extra	Miscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.
MTA_tax	0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge	0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.

Health Outcomes and Socioeconomic Factors
kaggle.com
zip
Updated Dec 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Health Outcomes and Socioeconomic Factors [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-trends-in-health-outcomes-and-socioec/code
Explore at:
zip(355475 bytes)Available download formats
Dataset updated
Dec 3, 2022
Authors
The Devastator
Description
Health Outcomes and Socioeconomic Factors

A Study of US County Data

By Data Exercises [source]

About this dataset

This dataset contains a wealth of health-related information and socio-economic data aggregated from multiple sources such as the American Community Survey, clinicaltrials.gov, and cancer.gov, covering a variety of US counties. Your task is to use this collection of data to build an Ordinary Least Squares (OLS) regression model that predicts the target death rate in each county. The model should incorporate variables related to population size, health insurance coverage, educational attainment levels, median incomes and poverty rates. Additionally you will need to assess linearity between your model parameters; measure serial independence among errors; test for heteroskedasticity; evaluate normality in the residual distribution; identify any outliers or missing values and determine how categories variables are handled; compare models through implementation with k=10 cross validation within linear regressions as well as assessing multicollinearity among model parameters. Examine your results by utilizing statistical agreements such as R-squared values and Root Mean Square Error (RMSE) while also interpreting implications uncovered by your analysis based on health outcomes compared to correlates among demographics surrounding those effected most closely by land structure along geographic boundaries throughout the United States

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides data on health outcomes, demographics, and socio-economic factors for various US counties from 2010-2016. It can be used to uncover trends in health outcomes and socioeconomic factors across different counties in the US over a six year period.

The dataset contains a variety of information including statefips (a two digit code that identifies the state), countyfips (a three digit code that identifies the county), avg household size, avg annual count of cancer cases, average deaths per year, target death rate, median household income, population estimate for 2015, poverty percent study per capita binned income as well as demographic information such as median age of male and female population percent married households adults with no high school diploma adults with high school diploma percentage with some college education bachelor's degree holders among adults over 25 years old employed persons 16 and over unemployed persons 16 and over private coverage available private coverage available alone temporary private coverage available public coverage available public coverage available alone percentages of white black Asian other race married households and birth rate.

Using this dataset you can build a multivariate ordinary least squares regression model to predict “target_deathrate”. You will also need to implement k-fold (k=10) cross validation to best select your model parameters. Model diagnostics should be performed in order to assess linearity serial independence heteroskedasticity normality multicollinearity etc., while outliers missing values or categorical variables will also have an effect your model selection process. Finally it is important to interpret the resulting models within their context based upon all given factors associated with it such as outliers missing values demographic changes etc., before arriving at a meaningful conclusion which may explain trends in health outcomes and socioeconomic factors found within this dataset

Research Ideas

Analysis of factors influencing target deathrates in different US counties.

Prediction of the effects of varying poverty levels on health outcomes in different US counties.

In-depth analysis of how various socio-economic factors (e.g., median income, educational attainment, etc.) contribute to overall public health outcomes in US counties

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. -...
Student Performance and Clustering Dataset
kaggle.com
zip
Updated Oct 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Khubaib Ahmad (2025). Student Performance and Clustering Dataset [Dataset]. https://www.kaggle.com/datasets/muhammadkhubaibahmad/student-performance-and-clustering-dataset
Explore at:
zip(7906 bytes)Available download formats
Dataset updated
Oct 24, 2025
Authors
Muhammad Khubaib Ahmad
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Description:

This dataset contains performance, attendance, and participation metrics of 300 students, intended for clustering, exploratory data analysis (EDA), and educational analytics. It can be used to explore relationships between quizzes, exams, GPA, attendance, lab sessions, and other academic indicators.

This dataset is ideal for unsupervised learning exercises, clustering students based on performance patterns, or for demonstrating educational analytics workflows.

Note: This is a small dataset (300 rows) and is not suitable for training large-scale supervised models.

File Information:

File Name: student_performance.csv Format: CSV (Comma-Separated Values) Rows: 300 Columns: 16 features + optional identifier columns

Column Details: | Column Name | Type | Description | | ----------------------- | ------- | -------------------------------------------------------- | | student_id | int64 | Unique student identifier | | name | object | Student name (should be anonymized before use) | | age | int64 | Age of the student (years) | | gender | object | Gender of the student | | quiz1_marks | float64 | Marks obtained in Quiz 1 (0–10) | | quiz2_marks | float64 | Marks obtained in Quiz 2 (0–10) | | quiz3_marks | float64 | Marks obtained in Quiz 3 (0–10) | | total_assignments | int64 | Total number of assignments assigned | | assignments_submitted | float64 | Number of assignments submitted (NaN in current dataset) | | midterm_marks | float64 | Marks obtained in midterm exam (0–30) | | final_marks | float64 | Marks obtained in final exam (0–50) | | previous_gpa | float64 | GPA from previous semester (0–4 scale) | | total_lectures | int64 | Total number of lectures scheduled | | lectures_attended | int64 | Number of lectures attended | | total_lab_sessions | int64 | Total lab sessions assigned | | labs_attended | int64 | Number of lab sessions attended |

Suggested Usage:

Clustering: Group students based on performance metrics, attendance, and GPA trends.

Exploratory Data Analysis (EDA): Analyze correlations between attendance, quizzes, midterm/final scores, and GPA.

Educational Analytics: Derive participation rates, average scores, and performance trends.

Feature Engineering: Compute additional metrics like average quiz score, total participation, or engagement ratios. Preprocessing Notes:

Drop or impute assignments_submitted if using for ML.

Anonymize name to maintain privacy.

Categorical variable gender can be label encoded or one-hot encoded if needed.

License: CC BY 4.0 – Free to use, share, and adapt with proper attribution.

Citation: Muhammad Khubaib Ahmad, "Student Performance and Clustering Dataset", 2025, Kaggle. DOI: https://doi.org/10.34740/kaggle/dsv/13489035
Call Centre Queue Simulation
kaggle.com
zip
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donovan Bangs (2022). Call Centre Queue Simulation [Dataset]. https://www.kaggle.com/datasets/donovanbangs/call-centre-queue-simulation
Explore at:
zip(841475 bytes)Available download formats
Dataset updated
Sep 20, 2022
Authors
Donovan Bangs
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Call Centre Queue Simulation

A simulated call centre dataset and notebook, designed to be used as a classroom / tutorial dataset for Business and Operations Analytics.

This notebook details the creation of simulated call centre logs over the course of one year. For this dataset we are imagining a business whose lines are open from 8:00am to 6:00pm, Monday to Friday. Four agents are on duty at any given time and each call takes an average of 5 minutes to resolve.

The call centre manager is required to meet a performance target: 90% of calls must be answered within 1 minute. Lately, the performance has slipped. As the data analytics expert, you have been brought in to analyze their performance and make recommendations to return the centre back to its target.

The dataset records timestamps for when a call was placed, when it was answered, and when the call was completed. The total waiting and service times are calculated, as well as a logical for whether the call was answered within the performance standard.

Discrete-Event Simulation

Discrete-Event Simulation allows us to model real calling behaviour with a few simple variables.

Arrival Rate

Service Rate

Number of Agents

The simulations in this dataset are performed using the package simmer (Ucar et al., 2019). I encourage you to visit their website for complete details and fantastic tutorials on Discrete-Event Simulation.

Ucar I, Smeets B, Azcorra A (2019). “simmer: Discrete-Event Simulation for R.” Journal of Statistical Software, 90(2), 1–30.

For source code and simulation details, view the cross-posted GitHub notebook and Shiny app.

Synthetic Sri Lanka Fuel Prices 2010–2025

kaggle.com

Updated Aug 9, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Dewmi Nimnaadi (2025). Synthetic Sri Lanka Fuel Prices 2010–2025 [Dataset]. https://www.kaggle.com/datasets/dewminimnaadi/synthetic-sri-lanka-fuel-prices-20102025

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 9, 2025

Dataset provided by

Kaggle

Authors

Dewmi Nimnaadi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Sri Lanka

Description

This dataset contains synthetically generated monthly fuel price data for Sri Lanka from January 2010 to August 2025, covering five major fuel types:

Petrol 92
Petrol 95
Diesel Auto
Diesel Super
Kerosene

Prices are not real — they are created using a statistical simulation model that incorporates realistic market behaviors and macroeconomic effects such as:

Global oil price fluctuations
Exchange rate changes
Policy revisions and tax adjustments
Seasonal demand shifts
Crisis-related volatility (e.g., synthetic 2020 pandemic dip, 2022 FX/debt crisis spike)

The dataset is designed for educational, research, and data science practice purposes — ideal for time-series forecasting, trend visualization, and policy simulation exercises.

How to Use

You can use this dataset for:

⛽ Time-Series Forecasting – Build ARIMA, Prophet, LSTM, or XGBoost models to predict future fuel prices.
📈 Policy Impact Analysis – Simulate how events affect fuel prices.
📊 Data Visualization – Create dashboards showing trends by fuel type.
🧪 Feature Engineering – Generate lag features, moving averages, seasonal indicators, and volatility measures.
🔍 Categorical Analysis – Study correlations between change_reason and price changes.

Note: Missing values are included in certain months for some fuel types to simulate real-world data gaps. This allows testing of imputation and data cleaning techniques.

Data Dictionary

Column	Description	Type / Values	Example
date	Month start date (YYYY-MM-DD)	Date	`2022-07-01`
fuel_type	Fuel type	`Petrol_92`, `Petrol_95`, `Diesel_Auto`, `Diesel_Super`, `Kerosene`	`Petrol_92`
price_lkr_per_litre	Synthetic retail price per litre (LKR)	Integer, may have missing values	`470`
change_reason	Main driver of price change	`global_oil`, `fx_rate`, `policy_revision`, `tax_adjustment`, `seasonal`	`policy_revision`
notes	Additional context	String	`Synthetic monthly price index; not real market data.`

Example Uses

Forecast price_lkr_per_litre using historical patterns.
Compare volatility between fuel types.
Visualize the synthetic 2022 “crisis spike” and its recovery trend.
Apply missing value imputation methods for price gaps.

Important Notes

This dataset is entirely synthetic — it is not sourced from CEYPETCO, Lanka IOC, or any real provider.
It is intended only for learning and research purposes.
Missing values are intentional to mimic incomplete real-world datasets.
Price patterns are designed to be realistic but do not reflect real historical prices.

💬 Feel free to discuss anything related to this dataset in the comments — suggestions, ideas, or ways to improve it are welcome!

Bike Store Relational Database | SQL
kaggle.com
zip
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dillon Myrick (2023). Bike Store Relational Database | SQL [Dataset]. https://www.kaggle.com/datasets/dillonmyrick/bike-store-sample-database
Explore at:
zip(94412 bytes)Available download formats
Dataset updated
Aug 21, 2023
Authors
Dillon Myrick
Description
This is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.

Database Diagram:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">

Terms of Use

The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses
Football DataSet +96k matches (18 leagues)
kaggle.com
zip
Updated May 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Gębala (2023). Football DataSet +96k matches (18 leagues) [Dataset]. https://www.kaggle.com/datasets/bastekforever/complete-football-data-89000-matches-18-leagues
Explore at:
zip(9816722 bytes)Available download formats
Dataset updated
May 2, 2023
Authors
Sebastian Gębala
Description
The ultimate Football database for data analysis and machine learning

What you get:

+96,000 matches with detailed minute-by-minute history of the single game + players name (goals, yellow/red cards, penalty, var, penalty missed ect.) - factor INC Season 2021-2022 included

18 European Leagues from 10 Countries with their lead championship: - premier-league - 7600 matches (seasons 2002-2022) - laliga - 7220 matches (seasons 2003-2022) - serie-a - 7150 matches (seasons 2003-2022) - ligue-1 - 6757 matches (seasons 2004-2022) - championship - 6684 matches (seasons 2010-2022) - league-one - 6440 matches (seasons 2010-2022) - bundesliga - 5838 matches (seasons 2003-2022) - league-two - 6015 matches (seasons 2011-2022) - eredivisie - 5776 matches (seasons 2004-2022) - laliga2 - 5519 matches (seasons 2010-2022) - serie-b - 5286 matches (seasons 2010-2022) - ligue-2 - 4470 matches (seasons 2010-2022) - super-lig - 3504 matches (seasons 2010-2022) - jupiler-league - 3756 matches (seasons 2010-2022) - fortuna-1-liga - 3687 matches (seasons 2010-2022) - 2-bundesliga - 3503 matches (seasons 2010-2022) - liga-portugal - 3414 matches (seasons 2010-2022) - pko-bp-ekstraklasa - 3338 matches (seasons 2010-2022)

Betting odds +winning betting odds Statistics Detailed match events (goal types, possession, corner, cross, fouls, cards etc…) for +96,000 matches

Why this data?

You can easily find data about football matches but they are usually scattered across different websites and those data in my opinion are missing with good shaped game's events. Therefore the most usefull part of this DataSet is factor INC which is in fact the register of game events minute-by-minute (goals, cards, vars, missed penalties ect.) collected in python list. Example Swansea-Reading:

"INC": [ "08' Yellow_Away - Griffin A.", "12' Yellow_Away - Khizanishvili Z.", "12' Yellow_Home - Borini F.", "21' Goal_Home - Penalty Sinclair S.(Penalty )", "22' Goal_Home - Sinclair S.(Dobbie S.)", "39' Yellow_Away - McAnuff J.", "40' Goal_Home - Dobbie S.", "46' Red_Card_Away - Tabb J.", "49' Own_Away - Allen J.()", "54' Yellow_Home - Allen J.", "57' Goal_Away - Mills M.(McAnuff J.)", "80' Goal_Home - Sinclair S. (Penalty)", "82' Yellow_Home - Gower M." ],

Those data are scraped form one of the livesscores web page provider. I own program written in python which can scrape data from any league all around the world (but anyway it takes time and the program itself needs constant updating as the providers changing source code).

Locally my Dataset is larger because it contains +100 factors, i.e. it contains infos about previous game with all infos about that games and more additional infos. I shortend the DataSet uploaded on kaggle to make it simpler and more understandable.

License

I must insist that you do not make any commercial use of the data. I give this DataSet to your none-commercial use.

Cooperation

sebastian.gebala@gmail.com
NHL Draft Hockey Player Data (1963 - 2022)
kaggle.com
zip
Updated Aug 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt OP (2022). NHL Draft Hockey Player Data (1963 - 2022) [Dataset]. https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022
Explore at:
zip(350802 bytes)Available download formats
Dataset updated
Aug 3, 2022
Authors
Matt OP
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset contains every player drafted in the NHL Draft from (1963 - 2022).

The data was collected from Sports Reference then cleaned for data analysis.

Tabular data includes: - year: Year of draft - overall_pick: Overall pick player was drafted - team: Team player drafted to - player: Player drafted - nationality: Nationality of player drafted - position: Player position - age: Player age - to_year: Year draft pick played to - amateur_team: Amateur team drafted from - games_played: Total games played by player (non-goalie) - goals: Total goals - assists: Total assists - points: Total points - plus_minus: Plus minus of player - penalties_minutes: Penalties in minutes - goalie_games_played: Goalie games played - goalie_wins - goalie_losses - goalie_ties_overtime: Ties plus overtime/shootout losses - save_percentage - goals_against_average - point_shares

Facebook

Twitter

Click to copy link

Link copied

Cite

Ísis Santos Costa (2025). 🎓 365DS Practice Exams • People Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/isissantoscosta/365ds-practice-exams-people-analytics-dataset

🎓 365DS Practice Exams • People Analytics Dataset

People Analytics Data used in « 365 Data Science Practice Exames • SQL »

Explore at:

zip(61775349 bytes)Available download formats

Dataset updated

May 20, 2025

Authors

Ísis Santos Costa

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

This dataset has been uploaded to Kaggle on the occasion of solving questions of the 365 Data Science • Practice Exams: SQL curriculum, a set of free resources designed to help test and elevate data science skills. The dataset consists of a synthetic, relational collection of data structured to simulate common employee and organizational data scenarios, ideal for practicing SQL queries and data analysis skills in a People Analytics context.

The dataset contains the following tables:

departments.csv: List of all company departments. dept_emp.csv: Historical and current assignments of employees to departments. dept_manager.csv: Historical and current assignments of employees as department managers. employees.csv: Core employee demographic information. employees.db: A SQLite database containing all the relational tables from the CSV files. salaries.csv: Historical salary records for employees. titles.csv: Historical job titles held by employees.

Usage

The dataset is ideal for practicing SQL queries and data analysis skills in a People Analytics context. It serves applications on both general Data Analytics, and also Time Series Analysis.

A practical application is presented on the 🎓 365DS Practice Exams • SQL notebook, which covers in detail answers to the questions of SQL Practice Exams 1, 2, and 3 on the 365DS platform, especially ilustrating the usage and the value of SQL procedures and functions.

Acknowledgements & Data Origin

This dataset has a rich lineage, originating from academic research and evolving through various formats to its current relational structure:

Original Authors

The foundational dataset was authored by Prof. Dr. Fusheng Wang 🔗 (then a PhD student at the University of California, Los Angeles - UCLA) and his advisor, Prof. Dr. Carlo Zaniolo 🔗 (UCLA). This work is primarily described in their paper:

Wang, F., & Zaniolo, C. (2004). Publishing and Querying the Histories of Archived Relational Databases in XML. DOI:10.1109/WISE.2003.1254473.

Relational Conversion

It was originally distributed as an .xml file. Giuseppe Maxia (known as @datacharmer on GitHub🔗 and LinkedIn🔗, as well as here on Kaggle) converted it into its relational form and subsequently distributed it as a .sql file, making it accessible for relational database use.

Kaggle Upload

This .sql version was then loaded to Kaggle as the « Employees Dataset » by Mirza Huzaifa🔗 on February 5th, 2023.

Clear search

Close search

Google apps

Main menu

🎓 365DS Practice Exams • People Analytics Dataset

Usage

Acknowledgements & Data Origin

Original Authors

Relational Conversion

Kaggle Upload

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Healthcare Device Data Analysis with R

Context

Political Analysis Using R: Example Code and Data, Plus Data for Practice...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Breast Cancer Dataset

What is Breast Cancer Dataset?

How to use this dataset

Acknowledgments

The main idea for uploading this dataset is to practice data analysis with my students, as I am working in college and want my student to train our studying ideas in a big dataset, It may be not up to date and I mention the collecting years, but it is a good resource of data to practice

E-commerce_dataset

E-commerce_dataset

You can use this dataset for:

📁**Dataset Contents**

🧬 Data Dictionary

🧠 Possible Use Cases (Ideas & Projects)

🔍 Machine Learning

📦 Business Analytics

🧮 SQL Practice

🛠 How the Dataset Was Generated

The dataset was generated entirely in Python using:

Custom logic for:

⚠️ License

⭐ If you found this dataset helpful, please:

Premier League Matches Dataset - 2021 to 2025

Columns:

Customer Shopping Trends Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

SQL Case Study for Data Analysts

NBA-National Board of Accreditation-dataset

E-commerce Customer Behaviour Dataset

E-Commerce Customer Behavior Dataset

Features and Variables

Customer ID

Age

Gender

Location

Annual Income

Purchase History

Browsing History

Product Reviews

Time on Site

Data Summary

Example Entries

Purchase History

Browsing History

Product Review

Methodology

Potential Use Cases

License

Important Notes

NYC Yellow Taxi Trip Data

Context

📁Dataset Contents