Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset has been uploaded to Kaggle on the occasion of solving questions of the 365 Data Science • Practice Exams: SQL curriculum, a set of free resources designed to help test and elevate data science skills. The dataset consists of a synthetic, relational collection of data structured to simulate common employee and organizational data scenarios, ideal for practicing SQL queries and data analysis skills in a People Analytics context.
The dataset contains the following tables:
departments.csv: List of all company departments.
dept_emp.csv: Historical and current assignments of employees to departments.
dept_manager.csv: Historical and current assignments of employees as department managers.
employees.csv: Core employee demographic information.
employees.db: A SQLite database containing all the relational tables from the CSV files.
salaries.csv: Historical salary records for employees.
titles.csv: Historical job titles held by employees.
The dataset is ideal for practicing SQL queries and data analysis skills in a People Analytics context. It serves applications on both general Data Analytics, and also Time Series Analysis.
A practical application is presented on the 🎓 365DS Practice Exams • SQL notebook, which covers in detail answers to the questions of SQL Practice Exams 1, 2, and 3 on the 365DS platform, especially ilustrating the usage and the value of SQL procedures and functions.
This dataset has a rich lineage, originating from academic research and evolving through various formats to its current relational structure:
The foundational dataset was authored by Prof. Dr. Fusheng Wang 🔗 (then a PhD student at the University of California, Los Angeles - UCLA) and his advisor, Prof. Dr. Carlo Zaniolo 🔗 (UCLA). This work is primarily described in their paper:
It was originally distributed as an .xml file. Giuseppe Maxia (known as @datacharmer on GitHub🔗 and LinkedIn🔗, as well as here on Kaggle) converted it into its relational form and subsequently distributed it as a .sql file, making it accessible for relational database use.
This .sql version was then loaded to Kaggle as the « Employees Dataset » by Mirza Huzaifa🔗 on February 5th, 2023.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.
dirty_cafe_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Item | The name of the item purchased. May contain missing or invalid values (e.g., "ERROR"). | Coffee, Sandwich |
Quantity | The quantity of the item purchased. May contain missing or invalid values. | 1, 3, UNKNOWN |
Price Per Unit | The price of a single unit of the item. May contain missing or invalid values. | 2.00, 4.00 |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, 12.00 |
Payment Method | The method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN"). | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Takeaway |
Transaction Date | The date of the transaction. May contain missing or incorrect values. | 2023-01-01 |
Missing Values:
Item, Payment Method, Location) may contain missing values represented as None or empty cells.Invalid Values:
"ERROR" or "UNKNOWN" to simulate real-world data issues.Price Consistency:
The dataset includes the following menu items with their respective price ranges:
| Item | Price($) |
|---|---|
| Coffee | 2 |
| Tea | 1.5 |
| Sandwich | 4 |
| Salad | 5 |
| Cake | 3 |
| Cookie | 1 |
| Smoothie | 4 |
| Juice | 3 |
This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.
To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."
Handle Invalid Values:
"ERROR" and "UNKNOWN" with NaN or appropriate values.Date Consistency:
Feature Engineering:
Day of the Week or Transaction Month, for further analysis.This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.
If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.
Facebook
TwitterHi. This is my data analysis project and also try using R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of data from health monitoring device. For detailed background story, please check the pdf file (Case 02.pdf) for reference.
In this case study, I use personal health tracker data from Fitbit to evaluate the how the usage of health tracker device, and then determine if there are any trends or patterns.
My data analysis will be focus in 2 area: exercise activity and sleeping habit. Exercise activity will be a study of relationship between activity type and calories consumed, while sleeping habit will be identify any the pattern of user sleeping. In this analysis, I will also try to use some linear regression model, so that the data can be explain in a quantitative way and make prediction easier.
I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.
Stanley Cheng 2021-10-07
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Breast Cancer Dataset?
Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.
.
https://user-images.githubusercontent.com/36210723/182301443-382b14e1-71c1-46ac-88f5-e72a9b2083e7.jpg" alt="cancer-1">
.
How to use this dataset
The key challenge against its detection is how to classify tumors into malignant (cancerous) or benign(non-cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.
Acknowledgments
When we use this dataset in our research, we credit the authors as :
License : CC BY 4.0.
This data set is taken from https://data.world/health/breast-cancer-wisconsin by the Donor: Nick Street and the Source: UCI - Machine Learning Repository.
The main idea for uploading this dataset is to practice data analysis with my students, as I am working in college and want my student to train our studying ideas in a big dataset, It may be not up to date and I mention the collecting years, but it is a good resource of data to practice
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a synthetic yet realistic E-commerce retail dataset generated programmatically using Python (Faker + NumPy + Pandas).
It is designed to closely mimic real-world online shopping behavior, user patterns, product interactions, seasonal trends, and marketplace events.
Machine Learning & Deep Learning
Recommender Systems
Customer Segmentation
Sales Forecasting
A/B Testing
E-commerce Behaviour Analysis
Data Cleaning / Feature Engineering Practice
SQL practice
The dataset contains 6 CSV files: ~~~ File Rows Description users.csv ~10,000 User profiles, demographics & signup info products.csv ~2,000 Product catalog with rating and pricing orders.csv ~20,000 Order-level transactions order_items.csv ~60,000 Items purchased per order reviews.csv ~15,000 Customer-written product reviews events.csv ~80,000 User event logs: view, cart, wishlist, purchase ~~~
1. Users (users.csv)
Column Description
user_id Unique user identifier
name Full customer name
email Email (synthetic, no real emails)
gender Male / Female / Other
city City of residence
signup_date Account creation date
2. Products (products.csv)
Column Description
product_id Unique product identifier
product_name Product title
category Electronics, Clothing, Beauty, Home, Sports, etc.
price Actual selling price
rating Average product rating
3. Orders (orders.csv)
Column Description
order_id Unique order identifier
user_id User who placed the order
order_date Timestamp of the order
order_status Completed / Cancelled / Returned
total_amount Total order value
4. Order Items (order_items.csv)
Column Description
order_item_id Unique identifier
order_id Associated order
product_id Purchased product
quantity Quantity purchased
item_price Price per unit
5. Reviews (reviews.csv)
Column Description
review_id Unique review identifier
user_id User who submitted review
product_id Reviewed product
rating 1–5 star rating
review_text Short synthetic review
review_date Submission date
6. Events (events.csv)
Column Description
event_id Unique event identifier
user_id User performing event
product_id Viewed/added/purchased product
event_type view/cart/wishlist/purchase
event_timestamp Timestamp of event
Customer churn prediction
Review sentiment analysis (NLP)
Recommendation engines
Price optimization models
Demand forecasting (Time-series)
Market basket analysis
RFM segmentation
Cohort analysis
Funnel conversion tracking
A/B testing simulations
Joins
Window functions
Aggregations
CTE-based funnels
Complex queries
Faker for realistic user and review generation
NumPy for probability-based event modeling
Pandas for data processing
demand variation
user behavior simulation
return/cancel probabilities
seasonal order timestamp distribution
The dataset does not include any real personal data.
Everything is generated synthetically.
This dataset is released under CC BY 4.0 — free to use for:
Research
Education
Commercial projects
Kaggle competitions
Machine learning pipelines
Just provide attribution.
Upvote the dataset
Leave a comment
Share your notebooks/notebooks using it
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains detailed information on all Premier League matches played between the 2021 and 2025 seasons. It includes match dates, times, venues, results, goals scored (gf), goals against (ga), expected goals (xg), possession percentages, attendance figures, team formations, referees, and other relevant statistics. This data can be used for analysis, modeling predictions, or exploring trends in Premier League football.
| Column Name | Description |
|---|---|
date | The date of the match (format: MM/DD/YYYY) |
time | The time of the match (in 24-hour format) |
comp | Competition name (e.g., Premier League) |
round | Match round or week number |
day | Day of the week when the match was played |
venue | Venue where the match took place |
result | Result of the match (W for Win, D for Draw, L for Loss) |
gf | Goals For - number of goals scored by the home team |
ga | Goals Against - number of goals conceded by the home team |
opponent | Name of the opposing team |
xg | Expected Goals for the home team |
xga | Expected Goals Against for the home team |
poss | Possession percentage |
attendance | Number of spectators attending the match |
captain | Captain's name for the home team |
formation | Formation used by the home team |
opp formation | Formation used by the opponent |
referee | Referee officiating the match |
match report | Link or reference to a detailed match report |
notes | Additional notes regarding specific matches |
sh | Total shots taken by the home team |
sot | Shots on target by the home team |
dist | Average distance covered in shots (in meters) |
fk | Number of free kicks awarded to the home team |
pk | Number of penalties awarded to the home team |
pkatt | Number of penalties attempted by the home team |
team | Name of the home team |
season | Season during which matches were played |
Facebook
TwitterThe Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.
This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.
https://i.imgur.com/6UEqejq.png" alt="">
This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.
Cover Photo by: Freepik
Thumbnail by: Clothing icons created by Flat Icons - Flaticon
Facebook
TwitterThis dataset is a practical SQL case study designed for learners who are looking to enhance their SQL skills in analyzing sales, products, and marketing data. It contains several SQL queries related to a simulated business database for product sales, marketing expenses, and location data. The database consists of three main tables: Fact, Product, and Location.
Objective of the Case Study: The purpose of this case study is to provide learners with a variety of practical SQL exercises that involve real-world business problems. The queries explore topics such as:
Facebook
TwitterThe description is a more detailed explanation of the dataset's content, source, and potential use cases. It helps users understand the dataset's relevance and usefulness for their projects. Here's an example description for the NBA players performance dataset: Description Example: "This dataset contains comprehensive performance statistics for NBA players from the 2020-2021 season. It includes player-level data such as points scored, rebounds, assists, field goal percentage, free throw percentage, and more. The data was collected from official NBA records and other reputable sources.
The dataset can be used for various data analysis and machine learning tasks related to NBA player performance. Analysts and researchers can explore player trends, compare individual performances, identify standout players, and investigate correlations between different performance metrics.
Whether you're an NBA enthusiast, a data scientist, or a basketball coach, this dataset provides valuable insights into the statistical aspects of player performance in the 2020-2021 NBA season. It is ideal for data-driven research, building predictive models, and gaining a deeper understanding of player contributions to their teams."
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The E-Commerce Customer Behavior Dataset is a synthetic dataset designed to capture the full spectrum of customer interactions with an online retail platform. Created by Gretel AI for educational and research purposes, it provides a comprehensive view of how customers browse, purchase, and review products. The dataset is ideal for data science practice, machine learning modeling, and exploratory analytics.
Structured list of products purchased, including:
Allows analysis of repeat purchases, product popularity, and category trends.
| Feature | Range / Distribution | Notes |
|---|---|---|
| Age | 24–65 | Mean: 40, Std: 11 |
| Gender | Female 52%, Male 36%, Other 12% | Categorical |
| Location | Most common: City D (24%), City E (12%), Other (64%) | Regional trends |
| Annual Income | $40,000–$100,000 | Mean: $65,800, Std: $16,900 |
| Time on Site | 32.5–486.3 mins | Mean: 233, Std: 109 |
[
{"Date": "2022-03-05", "Category": "Clothing", "Price": 34.99},
{"Date": "2022-02-12", "Category": "Electronics", "Price": 129.99},
{"Date": "2022-01-20", "Category": "Home & Garden", "Price": 29.99}
]
[
{"Timestamp": "2022-03-10T14:30:00Z"},
{"Timestamp": "2022-03-11T09:45:00Z"},
{"Timestamp": "2022-03-12T16:20:00Z"}
]
{
"Review Text": "Excellent product, highly recommend!",
"Rating": 5
}
This dataset was synthetically generated using machine learning techniques to simulate realistic customer behavior:
Pattern Recognition Identifying trends and correlations observed in real-world e-commerce datasets.
Synthetic Data Generation Producing data points for all features while preserving realistic relationships.
Controlled Variation Introducing diversity to reflect a wide range of customer behaviors while maintaining logical consistency.
CC BY 4.0 (Attribution 4.0 International) Free to use for educational and research purposes with attribution.
Facebook
Twitterhttps://www.usa.gov/government-works/https://www.usa.gov/government-works/
New York City (NYC) Taxi & Limousine Commission (TLC) keeps data from all its cabs, and it is freely available to download from its official website. You can access it here. Now, the TLC primarily keeps and manages data for 4 different types of vehicles: - Yellow Taxi: Yellow Medallion Taxicabs: These are the famous NYC yellow taxis that provide transportation exclusively through street hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged. - Green Taxi: Street Hail Livery: The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides. - For-Hire Vehicles (FHVs): FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.
| Field Name | Description |
|---|---|
| VendorID |
A code indicating the TPEP provider that provided the record.
|
| tpep_pickup_datetime | The date and time when the meter was engaged. |
| tpep_dropoff_datetime | The date and time when the meter was disengaged. |
| Passenger_count | The number of passengers in the vehicle. This is a driver-entered value. |
| Trip_distance | The elapsed trip distance in miles reported by the taximeter. |
| Pickup_longitude | Longitude where the meter was engaged. |
| Pickup_latitude | Latitude where the meter was engaged. |
| RateCodeID | The final rate code in effect at the end of the trip.
|
| Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip |
| Dropoff_longitude | Longitude where the meter was disengaged. |
| Dropoff_ latitude | Latitude where the meter was disengaged. |
| Payment_type | A numeric code signifying how the passenger paid for the trip.
|
| Fare_amount | The time-and-distance fare calculated by the meter. |
| Extra | Miscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges. |
| MTA_tax | 0.50 MTA tax that is automatically triggered based on the metered rate in use. |
| Improvement_surcharge | 0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015. |
Facebook
TwitterBy Data Exercises [source]
This dataset contains a wealth of health-related information and socio-economic data aggregated from multiple sources such as the American Community Survey, clinicaltrials.gov, and cancer.gov, covering a variety of US counties. Your task is to use this collection of data to build an Ordinary Least Squares (OLS) regression model that predicts the target death rate in each county. The model should incorporate variables related to population size, health insurance coverage, educational attainment levels, median incomes and poverty rates. Additionally you will need to assess linearity between your model parameters; measure serial independence among errors; test for heteroskedasticity; evaluate normality in the residual distribution; identify any outliers or missing values and determine how categories variables are handled; compare models through implementation with k=10 cross validation within linear regressions as well as assessing multicollinearity among model parameters. Examine your results by utilizing statistical agreements such as R-squared values and Root Mean Square Error (RMSE) while also interpreting implications uncovered by your analysis based on health outcomes compared to correlates among demographics surrounding those effected most closely by land structure along geographic boundaries throughout the United States
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides data on health outcomes, demographics, and socio-economic factors for various US counties from 2010-2016. It can be used to uncover trends in health outcomes and socioeconomic factors across different counties in the US over a six year period.
The dataset contains a variety of information including statefips (a two digit code that identifies the state), countyfips (a three digit code that identifies the county), avg household size, avg annual count of cancer cases, average deaths per year, target death rate, median household income, population estimate for 2015, poverty percent study per capita binned income as well as demographic information such as median age of male and female population percent married households adults with no high school diploma adults with high school diploma percentage with some college education bachelor's degree holders among adults over 25 years old employed persons 16 and over unemployed persons 16 and over private coverage available private coverage available alone temporary private coverage available public coverage available public coverage available alone percentages of white black Asian other race married households and birth rate.
Using this dataset you can build a multivariate ordinary least squares regression model to predict “target_deathrate”. You will also need to implement k-fold (k=10) cross validation to best select your model parameters. Model diagnostics should be performed in order to assess linearity serial independence heteroskedasticity normality multicollinearity etc., while outliers missing values or categorical variables will also have an effect your model selection process. Finally it is important to interpret the resulting models within their context based upon all given factors associated with it such as outliers missing values demographic changes etc., before arriving at a meaningful conclusion which may explain trends in health outcomes and socioeconomic factors found within this dataset
- Analysis of factors influencing target deathrates in different US counties.
- Prediction of the effects of varying poverty levels on health outcomes in different US counties.
- In-depth analysis of how various socio-economic factors (e.g., median income, educational attainment, etc.) contribute to overall public health outcomes in US counties
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. -...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains performance, attendance, and participation metrics of 300 students, intended for clustering, exploratory data analysis (EDA), and educational analytics. It can be used to explore relationships between quizzes, exams, GPA, attendance, lab sessions, and other academic indicators.
This dataset is ideal for unsupervised learning exercises, clustering students based on performance patterns, or for demonstrating educational analytics workflows.
Note: This is a small dataset (300 rows) and is not suitable for training large-scale supervised models.
File Name: student_performance.csv Format: CSV (Comma-Separated Values) Rows: 300 Columns: 16 features + optional identifier columns
Column Details:
| Column Name | Type | Description |
| ----------------------- | ------- | -------------------------------------------------------- |
| student_id | int64 | Unique student identifier |
| name | object | Student name (should be anonymized before use) |
| age | int64 | Age of the student (years) |
| gender | object | Gender of the student |
| quiz1_marks | float64 | Marks obtained in Quiz 1 (0–10) |
| quiz2_marks | float64 | Marks obtained in Quiz 2 (0–10) |
| quiz3_marks | float64 | Marks obtained in Quiz 3 (0–10) |
| total_assignments | int64 | Total number of assignments assigned |
| assignments_submitted | float64 | Number of assignments submitted (NaN in current dataset) |
| midterm_marks | float64 | Marks obtained in midterm exam (0–30) |
| final_marks | float64 | Marks obtained in final exam (0–50) |
| previous_gpa | float64 | GPA from previous semester (0–4 scale) |
| total_lectures | int64 | Total number of lectures scheduled |
| lectures_attended | int64 | Number of lectures attended |
| total_lab_sessions | int64 | Total lab sessions assigned |
| labs_attended | int64 | Number of lab sessions attended |
Suggested Usage:
License: CC BY 4.0 – Free to use, share, and adapt with proper attribution.
Citation: Muhammad Khubaib Ahmad, "Student Performance and Clustering Dataset", 2025, Kaggle. DOI: https://doi.org/10.34740/kaggle/dsv/13489035
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A simulated call centre dataset and notebook, designed to be used as a classroom / tutorial dataset for Business and Operations Analytics.
This notebook details the creation of simulated call centre logs over the course of one year. For this dataset we are imagining a business whose lines are open from 8:00am to 6:00pm, Monday to Friday. Four agents are on duty at any given time and each call takes an average of 5 minutes to resolve.
The call centre manager is required to meet a performance target: 90% of calls must be answered within 1 minute. Lately, the performance has slipped. As the data analytics expert, you have been brought in to analyze their performance and make recommendations to return the centre back to its target.
The dataset records timestamps for when a call was placed, when it was answered, and when the call was completed. The total waiting and service times are calculated, as well as a logical for whether the call was answered within the performance standard.
Discrete-Event Simulation allows us to model real calling behaviour with a few simple variables.
The simulations in this dataset are performed using the package simmer (Ucar et al., 2019). I encourage you to visit their website for complete details and fantastic tutorials on Discrete-Event Simulation.
Ucar I, Smeets B, Azcorra A (2019). “simmer: Discrete-Event Simulation for R.” Journal of Statistical Software, 90(2), 1–30.
For source code and simulation details, view the cross-posted GitHub notebook and Shiny app.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains synthetically generated monthly fuel price data for Sri Lanka from January 2010 to August 2025, covering five major fuel types:
Prices are not real — they are created using a statistical simulation model that incorporates realistic market behaviors and macroeconomic effects such as:
The dataset is designed for educational, research, and data science practice purposes — ideal for time-series forecasting, trend visualization, and policy simulation exercises.
You can use this dataset for:
change_reason and price changes.Note: Missing values are included in certain months for some fuel types to simulate real-world data gaps. This allows testing of imputation and data cleaning techniques.
| Column | Description | Type / Values | Example |
|---|---|---|---|
| date | Month start date (YYYY-MM-DD) | Date | 2022-07-01 |
| fuel_type | Fuel type | Petrol_92, Petrol_95, Diesel_Auto, Diesel_Super, Kerosene | Petrol_92 |
| price_lkr_per_litre | Synthetic retail price per litre (LKR) | Integer, may have missing values | 470 |
| change_reason | Main driver of price change | global_oil, fx_rate, policy_revision, tax_adjustment, seasonal | policy_revision |
| notes | Additional context | String | Synthetic monthly price index; not real market data. |
price_lkr_per_litre using historical patterns.💬 Feel free to discuss anything related to this dataset in the comments — suggestions, ideas, or ways to improve it are welcome!
Facebook
TwitterThis is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.
Database Diagram:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">
The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses
Facebook
TwitterWhat you get:
+96,000 matches with detailed minute-by-minute history of the single game + players name (goals, yellow/red cards, penalty, var, penalty missed ect.) - factor INC Season 2021-2022 included
18 European Leagues from 10 Countries with their lead championship: - premier-league - 7600 matches (seasons 2002-2022) - laliga - 7220 matches (seasons 2003-2022) - serie-a - 7150 matches (seasons 2003-2022) - ligue-1 - 6757 matches (seasons 2004-2022) - championship - 6684 matches (seasons 2010-2022) - league-one - 6440 matches (seasons 2010-2022) - bundesliga - 5838 matches (seasons 2003-2022) - league-two - 6015 matches (seasons 2011-2022) - eredivisie - 5776 matches (seasons 2004-2022) - laliga2 - 5519 matches (seasons 2010-2022) - serie-b - 5286 matches (seasons 2010-2022) - ligue-2 - 4470 matches (seasons 2010-2022) - super-lig - 3504 matches (seasons 2010-2022) - jupiler-league - 3756 matches (seasons 2010-2022) - fortuna-1-liga - 3687 matches (seasons 2010-2022) - 2-bundesliga - 3503 matches (seasons 2010-2022) - liga-portugal - 3414 matches (seasons 2010-2022) - pko-bp-ekstraklasa - 3338 matches (seasons 2010-2022)
Betting odds +winning betting odds Statistics Detailed match events (goal types, possession, corner, cross, fouls, cards etc…) for +96,000 matches
You can easily find data about football matches but they are usually scattered across different websites and those data in my opinion are missing with good shaped game's events. Therefore the most usefull part of this DataSet is factor INC which is in fact the register of game events minute-by-minute (goals, cards, vars, missed penalties ect.) collected in python list. Example Swansea-Reading:
"INC": [
"08' Yellow_Away - Griffin A.",
"12' Yellow_Away - Khizanishvili Z.",
"12' Yellow_Home - Borini F.",
"21' Goal_Home - Penalty Sinclair S.(Penalty )",
"22' Goal_Home - Sinclair S.(Dobbie S.)",
"39' Yellow_Away - McAnuff J.",
"40' Goal_Home - Dobbie S.",
"46' Red_Card_Away - Tabb J.",
"49' Own_Away - Allen J.()",
"54' Yellow_Home - Allen J.",
"57' Goal_Away - Mills M.(McAnuff J.)",
"80' Goal_Home - Sinclair S. (Penalty)",
"82' Yellow_Home - Gower M."
],
Those data are scraped form one of the livesscores web page provider. I own program written in python which can scrape data from any league all around the world (but anyway it takes time and the program itself needs constant updating as the providers changing source code).
Locally my Dataset is larger because it contains +100 factors, i.e. it contains infos about previous game with all infos about that games and more additional infos. I shortend the DataSet uploaded on kaggle to make it simpler and more understandable.
I must insist that you do not make any commercial use of the data. I give this DataSet to your none-commercial use.
sebastian.gebala@gmail.com
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains every player drafted in the NHL Draft from (1963 - 2022).
The data was collected from Sports Reference then cleaned for data analysis.
Tabular data includes:
- year: Year of draft
- overall_pick: Overall pick player was drafted
- team: Team player drafted to
- player: Player drafted
- nationality: Nationality of player drafted
- position: Player position
- age: Player age
- to_year: Year draft pick played to
- amateur_team: Amateur team drafted from
- games_played: Total games played by player (non-goalie)
- goals: Total goals
- assists: Total assists
- points: Total points
- plus_minus: Plus minus of player
- penalties_minutes: Penalties in minutes
- goalie_games_played: Goalie games played
- goalie_wins
- goalie_losses
- goalie_ties_overtime: Ties plus overtime/shootout losses
- save_percentage
- goals_against_average
- point_shares
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset has been uploaded to Kaggle on the occasion of solving questions of the 365 Data Science • Practice Exams: SQL curriculum, a set of free resources designed to help test and elevate data science skills. The dataset consists of a synthetic, relational collection of data structured to simulate common employee and organizational data scenarios, ideal for practicing SQL queries and data analysis skills in a People Analytics context.
The dataset contains the following tables:
departments.csv: List of all company departments.
dept_emp.csv: Historical and current assignments of employees to departments.
dept_manager.csv: Historical and current assignments of employees as department managers.
employees.csv: Core employee demographic information.
employees.db: A SQLite database containing all the relational tables from the CSV files.
salaries.csv: Historical salary records for employees.
titles.csv: Historical job titles held by employees.
The dataset is ideal for practicing SQL queries and data analysis skills in a People Analytics context. It serves applications on both general Data Analytics, and also Time Series Analysis.
A practical application is presented on the 🎓 365DS Practice Exams • SQL notebook, which covers in detail answers to the questions of SQL Practice Exams 1, 2, and 3 on the 365DS platform, especially ilustrating the usage and the value of SQL procedures and functions.
This dataset has a rich lineage, originating from academic research and evolving through various formats to its current relational structure:
The foundational dataset was authored by Prof. Dr. Fusheng Wang 🔗 (then a PhD student at the University of California, Los Angeles - UCLA) and his advisor, Prof. Dr. Carlo Zaniolo 🔗 (UCLA). This work is primarily described in their paper:
It was originally distributed as an .xml file. Giuseppe Maxia (known as @datacharmer on GitHub🔗 and LinkedIn🔗, as well as here on Kaggle) converted it into its relational form and subsequently distributed it as a .sql file, making it accessible for relational database use.
This .sql version was then loaded to Kaggle as the « Employees Dataset » by Mirza Huzaifa🔗 on February 5th, 2023.