19 datasets found

IMDB Movies Analysis - SQL
kaggle.com
zip
Updated Feb 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav B R (2023). IMDB Movies Analysis - SQL [Dataset]. https://www.kaggle.com/datasets/gauravbr/imdb-movies-data-erd
Explore at:
zip(3818401 bytes)Available download formats
Dataset updated
Feb 21, 2023
Authors
Gaurav B R
Description
SQL IMDB Movies Analysis for RSVP (Film Production Company)

RSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.

For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.
Nvidia Database
kaggle.com
zip
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajay Tom (2025). Nvidia Database [Dataset]. https://www.kaggle.com/datasets/ajayt0m/nvidia-database
Explore at:
zip(8712 bytes)Available download formats
Dataset updated
Jan 30, 2025
Authors
Ajay Tom
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a beginner-friendly SQLite database designed to help users practice SQL and relational database concepts. The dataset represents a basic business model inspired by NVIDIA and includes interconnected tables covering essential aspects like products, customers, sales, suppliers, employees, and projects. It's perfect for anyone new to SQL or data analytics who wants to learn and experiment with structured data.

Tables and Their Contents:

Products:

Includes details of 15 products (e.g., GPUs, AI accelerators). Attributes: product_id, product_name, category, release_date, price.

Customers:

Lists 20 fictional customers with their industry and contact information. Attributes: customer_id, customer_name, industry, contact_email, contact_phone.

Sales:

Contains 100 sales records tied to products and customers. Attributes: sale_id, product_id, customer_id, sale_date, region, quantity_sold, revenue.

Suppliers:

Features 50 suppliers and the materials they provide. Attributes: supplier_id, supplier_name, material_supplied, contact_email.

Supply Chain:

Tracks materials supplied to produce products, proportional to sales. Attributes: supply_chain_id, supplier_id, product_id, supply_date, quantity_supplied.

Departments:

Lists 5 departments within the business. Attributes: department_id, department_name, location.

Employees:

Contains data on 30 employees and their roles in different departments. Attributes: employee_id, first_name, last_name, department_id, hire_date, salary.

Projects:

Describes 10 projects handled by different departments. Attributes: project_id, project_name, department_id, start_date, end_date, budget.

Why Use This Dataset?

Perfect for Beginners: The dataset is simple and easy to understand.

Interconnected Tables: Provides a basic introduction to relational database concepts like joins and foreign keys.

SQL Practice: Run basic queries, filter data, and perform simple aggregations or calculations.

Learning Tool: Great for small projects and understanding business datasets.

Potential Use Cases:

Practice SQL queries (SELECT, INSERT, UPDATE, DELETE, JOIN).

Understand how to design and query relational databases.

Analyze basic sales and supply chain data for patterns and trends.

Learn how to use databases in analytics tools like Excel, Power BI, or Tableau.

Data Size:

Number of Tables: 8 Total Rows: Around 230 across all tables, ensuring quick queries and easy exploration.
SQL Integrity Journey: Unleashing Data Constraints
kaggle.com
zip
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radha Gandhi (2023). SQL Integrity Journey: Unleashing Data Constraints [Dataset]. https://www.kaggle.com/datasets/radhagandhi/sql-integrity-journey-unleashing-data-constraints
Explore at:
zip(13817 bytes)Available download formats
Dataset updated
Oct 9, 2023
Authors
Radha Gandhi
Description
**Title: **Practical Exploration of SQL Constraints: Building a Foundation in Data Integrity Introduction: Welcome to my Data Analysis project, where I focus on mastering SQL constraints—a pivotal aspect of database management. This project centers on hands-on experience with SQL's Data Definition Language (DDL) commands, emphasizing constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, and DEFAULT. In this project, I aim to demonstrate my foundational understanding of enforcing data integrity and maintaining a structured database environment. Purpose: The primary purpose of this project is to showcase my proficiency in implementing and managing SQL constraints for robust data governance. By delving into the realm of constraints, you'll gain insights into my SQL skills and how I utilize constraints to ensure data accuracy, consistency, and reliability within relational databases. What to Expect: Within this project, you will find a series of projects that focus on the implementation and utilization of SQL constraints. These projects highlight my command over the following key constraint types: NOT NULL: The NOT NULL constraint is crucial for ensuring the presence of essential data in a column. PRIMARY KEY: Ensuring unique identification of records for data integrity. FOREIGN KEY: Establishing relationships between tables to maintain referential integrity. UNIQUE: Guaranteeing the uniqueness of values within specified columns. CHECK: Implementing custom conditions to validate data entries. DEFAULT: Setting default values for columns to enhance data reliability. Each exercise within this project is accompanied by clear and concise SQL scripts, explanations of the intended outcomes, and practical insights into the application of these constraints. My goal is to showcase how SQL constraints serve as crucial tools for creating a structured and dependable database foundation. I invite you to explore these projects in detail, where I provide hands-on examples that highlight the importance and utility of SQL constraints. Together, these projects underscore my commitment to upholding data quality, ensuring data accuracy, and harnessing the power of SQL constraints for informed decision-making in data analysis. 3.1 CONSTRAINT - ENFORCING NOT NULL CONSTRAINT WHILE CREATING NEW TABLE. 3.2 CONSTRAINT- ENFORCE NOT NULL CONSTRAINT ON EXISTING COLUMN. 3.3 CONSTRAINT - ENFORCING PRIMARY KEY CONSTRAINT WHILE CREATING A NEW TABLE. 3.4 CONSTRAINT - ENFORCE PRIMARY KEY CONSTRAINT ON EXISTING COLUMN. 3.5 CONSTRAINT - ENFORCING FOREIGN KEY CONSTRAINT WHILE CREATING NEW TABLE. 3.6 CONSTRAINT - ENFORCE FOREIGN KEY CONSTRAINT ON EXISTING COLUMN. 3.7CONSTRAINT - ENFORCING UNIQUE CONSTRAINTS WHILE CREATING A NEW TABLE. 3.8 CONSTRAINT - ENFORCING UNIQUE CONSTRAINT IN EXISTING TABLE. 3.9 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN NEW TABLE. 3.10 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN THE EXISTING TABLE. 3.11 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE NEW TABLE. 3.12 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE EXISTING TABLE.
BookMyShow-SQL-Data-Analysis
kaggle.com
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soumendu Ray (2025). BookMyShow-SQL-Data-Analysis [Dataset]. https://www.kaggle.com/datasets/soumenduray99/bookmyshow-sql-data-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Soumendu Ray
Description
🎟️ BookMyShow SQL Data Analysis 🎯 Objective This project leverages SQL-based analysis to gain actionable insights into user engagement, movie performance, theater efficiency, payment systems, and customer satisfaction on the BookMyShow platform. The goal is to enhance platform performance, boost revenue, and optimize user experience through data-driven strategies.

📊 Key Analysis Areas 1. 👥 User Behavior & Engagement Identify most active users and repeat customers Track unique monthly users Analyze peak booking times and average tickets per user Drive engagement strategies and boost customer retention 2. 🎬 Movie Performance Analysis Highlight top-rated and most booked movies Analyze popular languages and high-revenue genres Study average occupancy rates Focus marketing on high-performing genres and content 3. 🏢 Theater & Show Performance Pinpoint theaters with highest/lowest bookings Evaluate popular show timings Measure theater-wise revenue contribution and occupancy Improve theater scheduling and resource allocation 4. 💵 Booking & Revenue Insights Track total revenue, top spenders, and monthly booking patterns Discover most used payment methods Calculate average price per booking and bookings per user Optimize revenue generation and spending strategies 5. 🪑 Seat Utilization & Pricing Strategy Identify most booked seat types and their revenue impact Analyze seat pricing variations and price elasticity Align pricing strategy with demand patterns for higher revenue 6. ✅❌ Payment & Transaction Analysis Distinguish successful vs. failed transactions Track refund frequency and payment delays Evaluate revenue lost due to failures Enhance payment processing systems 7. ⭐ User Reviews & Sentiment Analysis Measure average ratings per movie Identify top and lowest-rated content Analyze review volume and sentiment trends Leverage feedback to refine content offerings 🧰 Tech Stack Query Language: SQL (MySQL/PostgreSQL) Database Tools: DBeaver, pgAdmin, or any SQL IDE Visualization (Optional): Power BI / Tableau for presenting insights Version Control: Git & GitHub
Wikipedia SQLITE Portable DB, Huge 5M+ Rows
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
Explore at:
zip(6064169983 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
christernyc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

Key Features:

Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

The database consists of four main tables:

items: Contains information about Wikipedia items, including labels and descriptions

properties: Stores details about Wikidata properties, such as labels and descriptions

pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts

link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

Usage with LIKE queries: ``` import aiosqlite import asyncio

class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

async def _aenter_(self): self.conn = await aiosqlite.connect(self.db_file) return self async def _aexit_(self, exc_type, exc_val, exc_tb): await self.conn.close() async def search_pages_by_title(self, title): query = """ SELECT pages.page_id, pages.item_id, pages.title, pages.views, items.labels AS item_labels, items.description AS item_description, link_annotated_text.sections FROM pages JOIN items ON pages.item_id = items.id JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id WHERE pages.title LIKE ? """ async with self.conn.execute(query, (f"%{title}%",)) as cursor: return await cursor.fetchall() async def search_items_by_label_or_description(self, keyword): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? OR description LIKE ? """ async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor: return await cursor.fetchall() async def search_items_by_label(self, label): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? """ async with self.conn.execute(query, (f"%{label}%",)) as cursor: return await cursor.fetchall() async def search_properties_by_label_or_desc...
Healthcare Fraud Detection Dataset
kaggle.com
zip
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishal Jaiswal (2025). Healthcare Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/jaiswalmagic1/healthcare-fraud-detection-dataset
Explore at:
zip(10427537 bytes)Available download formats
Dataset updated
Mar 6, 2025
Authors
Vishal Jaiswal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains comprehensive synthetic healthcare data designed for fraud detection analysis. It includes information on patients, healthcare providers, insurance claims, and payments. The dataset is structured to mimic real-world healthcare transactions, where fraudulent activities such as false claims, overbilling, and duplicate charges can be identified through advanced analytics.

The dataset is suitable for practicing SQL queries, exploratory data analysis (EDA), machine learning for fraud detection, and visualization techniques. It is designed to help data analysts and data scientists develop and refine their analytical skills in the healthcare insurance domain.

Dataset Overview The dataset consists of four CSV files:

Patients Data (patients.csv)

Contains demographic details of patients, such as age, gender, insurance type, and location. Can be used to analyze patient demographics and healthcare usage patterns. Providers Data (providers.csv)

Contains information about healthcare providers, including provider ID, specialty, location, and associated hospital.

Useful for identifying fraudulent claims linked to specific providers or hospitals. Claims Data (claims.csv)

Contains records of insurance claims made by patients, including diagnosis codes, treatment details, provider ID, and claim amount.

Can be analyzed for suspicious patterns, such as excessive claims from a single provider or duplicate claims for the same patient.

Payments Data (payments.csv) Contains details of claim payments made by insurance companies, including payment amount, claim ID, and reimbursement status.

Helps in detecting discrepancies between claims and actual reimbursements. Possible Analysis Ideas

This dataset allows for multiple analysis approaches, including but not limited to:

🔹 Fraud Detection: Identify patterns in claims data to detect fraudulent activities (e.g., excessive billing, duplicate claims). 🔹 Provider Behavior Analysis: Analyze providers who have an unusually high claim volume or high rejection rates. 🔹 Payment Trends: Compare claims vs. payments to find irregularities in reimbursement patterns. 🔹 Patient Demographics & Utilization: Explore which patient groups are more likely to file claims and receive reimbursements. 🔹 SQL Query Practice: Perform advanced SQL queries, including joins, aggregations, window functions, and subqueries, to extract insights from the data.

Use Cases Practicing SQL queries for job interviews and real-world projects. Learning data cleaning, data wrangling, and feature engineering for healthcare analytics. Applying machine learning techniques for fraud detection. Gaining insights into the healthcare insurance domain and its challenges.

License & Usage License: CC0 Public Domain (Free to use for any purpose).

Attribution: Not required but appreciated. Intended Use: This dataset is for educational and research purposes only.

This dataset is an excellent resource for aspiring data analysts, data scientists, and SQL learners who want to gain hands-on experience in healthcare fraud detection.
SQLite Sakila Sample Database
kaggle.com
zip
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/datasets/atanaskanev/sqlite-sakila-sample-database/code
Explore at:
zip(4495190 bytes)Available download formats
Dataset updated
Mar 14, 2021
Authors
Atanas Kanev
Description
SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

Oracle

SQL Server

SQLIte

Interbase/Firebird

Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
Superstore Snowflake Schema Modeling Dataset
kaggle.com
zip
Updated Oct 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chik0di (2025). Superstore Snowflake Schema Modeling Dataset [Dataset]. https://www.kaggle.com/datasets/chik0di/superstore-snowflake-schema-modeling-dataset
Explore at:
zip(474167 bytes)Available download formats
Dataset updated
Oct 30, 2025
Authors
Chik0di
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset represents a Snowflake Schema model built from the popular Tableau Superstore dataset which exists primarily in a denormalized (flat) format.

This version is fully structured into fact and dimension tables, making it ready for data warehouse design, SQL analytics, and BI visualization projects.

The dataset was modeled to demonstrate dimensional modeling best practices, showing how the original flat Superstore data can be normalized into related dimensions and a central fact table.

Use this dataset to: - Practice SQL joins and schema design - Build ETL pipelines or dbt models - Design Power BI dashboards - Learn data warehouse normalization (3NF → Snowflake) concepts - Simulate enterprise data warehouse reporting environments

I’m open to suggestions or improvements from the community — feel free to share ideas on additional dimensions, measures, or transformations that could improve and make this dataset even more useful for learning and analysis.

Transformation was done using dbt, check out the models and the entire project.
Hospital Management Dataset
kaggle.com
zip
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset/data
Explore at:
zip(11375 bytes)Available download formats
Dataset updated
May 30, 2025
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Logistics Operations Database
kaggle.com
zip
Updated Nov 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yogape Rodriguez (2025). Logistics Operations Database [Dataset]. https://www.kaggle.com/datasets/yogape/logistics-operations-database
Explore at:
zip(15059576 bytes)Available download formats
Dataset updated
Nov 23, 2025
Authors
Yogape Rodriguez
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Kaggle Dataset: Synthetic Logistics Operations Database (2022-2024)

About this Dataset

What's Inside

A complete operational database from a fictional Class 8 trucking company spanning three years. This isn't scraped web data or simplified tutorial content—it's a realistic simulation built from 12 years of real-world logistics experience, designed specifically for analysts transitioning into supply chain and transportation domains.

The dataset contains 85,000+ records across 14 interconnected tables covering everything from driver assignments and fuel purchases to maintenance schedules and delivery performance. Each table maintains proper foreign key relationships, making this ideal for practicing complex SQL queries, building data pipelines, or developing operational dashboards.

Who This Is For

SQL Learners: Master window functions, CTEs, and multi-table JOINs using realistic business scenarios rather than contrived examples.

Data Analysts: Build portfolio projects that demonstrate understanding of operational metrics: cost-per-mile analysis, fleet utilization optimization, driver performance scorecards.

Aspiring Supply Chain Analysts: Work with authentic logistics data patterns—seasonal freight volumes, equipment utilization rates, route profitability calculations—without NDA restrictions.

Data Science Students: Develop predictive models for maintenance scheduling, driver retention, or route optimization using time-series data with actual business context.

Career Changers: If you're moving from operations into analytics (like the dataset creator), this provides a bridge—your domain knowledge becomes a competitive advantage rather than a gap to explain.

Why This Dataset Exists

Most logistics datasets are either proprietary (unavailable) or overly simplified (unrealistic). This fills the gap: operational complexity without confidentiality concerns. The data reflects real industry patterns:

Fuel prices track the 2022 diesel spike and 2023-2024 decline

Driver turnover sits at 15% annually (industry standard)

Equipment utilization averages 65% (typical for dry van operations)

On-time delivery performance ranges 85-95% (realistic service levels)

Maintenance intervals follow Class 8 PM schedules

Dataset Structure

Core Entities (Reference Tables): - Drivers (150 records) - Demographics, employment history, CDL info - Trucks (120 records) - Fleet specs, acquisition dates, status - Trailers (180 records) - Equipment types, current assignments - Customers (200 records) - Shipper accounts, contract terms, revenue potential - Facilities (50 records) - Terminals and warehouses with geocoordinates - Routes (60+ records) - City pairs with distances and rate structures

Operational Transactions: - Loads (57,000+ records) - Shipment details, revenue, booking type - Trips (57,000+ records) - Driver-truck assignments, actual performance - Fuel Purchases (131,000+ records) - Transaction-level data with pricing - Maintenance Records (6,500+ records) - Service history, costs, downtime - Delivery Events (114,000+ records) - Pickup/delivery timestamps, detention - Safety Incidents (114 records) - Accidents, violations, claims

Aggregated Analytics: - Driver Monthly Metrics (5,400+ records) - Performance summaries - Truck Utilization Metrics (3,800+ records) - Equipment efficiency

Key Features

Temporal Coverage: January 2022 through December 2024 (3 years)

Geographic Scope: National operations across 25+ major US cities

Realistic Patterns: - Seasonal freight fluctuations (Q4 peaks) - Historical fuel price accuracy - Equipment lifecycle modeling - Driver retention dynamics - Service level variations

Data Quality: - Complete foreign key integrity - No orphaned records - Intentional 2% null rate in driver/truck assignments (reflects reality) - All timestamps properly sequenced - Financial calculations verified

Use Case Examples

Business Intelligence: Create executive dashboards showing revenue per truck, cost per mile, driver efficiency rankings, maintenance spend by equipment age, customer concentration risk.

Predictive Analytics: Build models forecasting equipment failures based on maintenance history, predict driver turnover using performance metrics, estimate route profitability for new lanes.

Operations Optimization: Analyze route efficiency, identify underutilized assets, optimize maintenance scheduling, calculate ideal fleet size, evaluate driver-to-truck ratios.

SQL Mastery: Practice window functions for running totals and rankings, write complex JOINs across 6+ tables, implement CTEs for hierarchical queries, perform cohort analysis on driver retention.

Sample Questions to Explore

Which routes generate the highest profit margin after fuel costs?

How does driver tenure correlate with fuel ef...
BigQuery Fintech Dataset
kaggle.com
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Keser (2024). BigQuery Fintech Dataset [Dataset]. https://www.kaggle.com/datasets/mustafakeser4/bigquery-fintech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mustafa Keser
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset: cloud-training-demos.fintech

This dataset, hosted on BigQuery, is designed for financial technology (fintech) training and analysis. It comprises six interconnected tables, each providing detailed insights into various aspects of customer loans, loan purposes, and regional distributions. The dataset is ideal for practicing SQL queries, building data models, and conducting financial analytics.

Tables:

customer:
Contains records of individual customers, including demographic details and unique customer IDs. This table serves as a primary reference for analyzing customer behavior and loan distribution.

loan:
Includes detailed information about each loan issued, such as the loan amount, interest rate, and tenure. The table is crucial for analyzing lending patterns and financial outcomes.

loan_count_by_year:
Provides aggregated loan data by year, offering insights into yearly lending trends. This table helps in understanding the temporal dynamics of loan issuance.

loan_purposes:
Lists various reasons or purposes for which loans were issued, along with corresponding loan counts. This data can be used to analyze customer needs and market demands.

loan_with_region:
Combines loan data with regional information, allowing for geographical analysis of lending activities. This table is key for regional market analysis and understanding how loan distribution varies across different areas.

state_region:
Maps state names to their respective regions, enabling a more granular geographical analysis when combined with other tables in the dataset.

Use Cases:

Customer Segmentation: Analyze customer data to identify distinct segments based on demographics and loan behaviors.

Loan Analysis: Explore loan issuance patterns, interest rates, and purposes to uncover trends and insights.

Regional Analysis: Combine loan and region data to understand how loan distributions vary by geography.

Temporal Trends: Utilize the loan_count_by_year table to observe how lending patterns evolve over time.

This dataset is ideal for those looking to enhance their skills in SQL, financial data analysis, and BigQuery, providing a comprehensive foundation for fintech-related projects and case studies.
Model Car - Mint Classics
kaggle.com
zip
Updated Apr 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaston Saracusti (2024). Model Car - Mint Classics [Dataset]. https://www.kaggle.com/datasets/gastonsaracusti/model-car-mint-classics
Explore at:
zip(26650 bytes)Available download formats
Dataset updated
Apr 29, 2024
Authors
Gaston Saracusti
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Mint Classics Company, a retailer of classic model cars and other vehicles, is looking at closing one of their storage facilities.

To support a data-based business decision, they are looking for suggestions and recommendations for reorganizing or reducing inventory, while still maintaining timely service to their customers. For example, they would like to be able to ship a product to a customer within 24 hours of the order being placed.

As a data analyst, you have been asked to use MySQL Workbench to familiarize yourself with the general business by examining the current data. You will be provided with a data model and sample data tables to review. You will then need to isolate and identify those parts of the data that could be useful in deciding how to reduce inventory. You will write queries to answer questions like these:

1) Where are items stored and if they were rearranged, could a warehouse be eliminated?

2) How are inventory numbers related to sales figures? Do the inventory counts seem appropriate for each item?

3) Are we storing items that are not moving? Are any items candidates for being dropped from the product line?

The answers to questions like those should help you to formulate suggestions and recommendations for reducing inventory with the goal of closing one of the storage facilities.

Project Objectives

Explore products currently in inventory.

Determine important factors that may influence inventory reorganization/reduction.

Provide analytic insights and data-driven recommendations.

Your Challenge

Your challenge will be to conduct an exploratory data analysis to investigate if there are any patterns or themes that may influence the reduction or reorganization of inventory in the Mint Classics storage facilities. To do this, you will import the database and then analyze data. You will also pose questions, and seek to answer them meaningfully using SQL queries to retrieve data from the database provided.

In this project, we'll use the fictional Mint Classics relational database and a relational data model. Both will be provided.

After you perform your analysis, you will share your findings.
Kimia Farma: Performance Analysis 2020-2023
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anggun Dwi Lestari (2025). Kimia Farma: Performance Analysis 2020-2023 [Dataset]. https://www.kaggle.com/datasets/anggundwilestari/kimia-farma-performance-analysis-2020-2023
Explore at:
zip(30284703 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
Anggun Dwi Lestari
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19062145%2F025ccf521f62db512b4a98edd0b3508a%2FKimia_Farma_Dashboard.jpg?generation=1748428094441761&alt=media" alt="">This project analyzes Kimia Farma's performance from 2020 to 2023 using Google Looker Studio. The analysis is based on a pre-processed dataset stored in BigQuery, which serves as the data source for the dashboard.

Project Scope

The dashboard is designed to provide insights into branch performance, sales trends, customer ratings, and profitability. The development is ongoing, with multiple pages planned for a more in-depth analysis.

Current Progress

✅ The first page of the dashboard is completed
✅ A sample dashboard file is available on Kaggle
🔄 Development will continue with additional pages

Dataset Overview

The dataset consists of transaction records from Kimia Farma branches across different cities and provinces. Below are the key columns used in the analysis: - transaction_id: Transaction ID code - date: Transaction date - branch_id: Kimia Farma branch ID code - branch_name: Kimia Farma branch name - kota: City of the Kimia Farma branch - provinsi: Province of the Kimia Farma branch - rating_cabang: Customer rating of the Kimia Farma branch - customer_name: Name of the customer who made the transaction - product_id: Product ID code - product_name: Name of the medicine - actual_price: Price of the medicine - discount_percentage: Discount percentage applied to the medicine - persentase_gross_laba: Gross profit percentage based on the following conditions:
Price ≤ Rp 50,000 → 10% profit
Price > Rp 50,000 - 100,000 → 15% profit
Price > Rp 100,000 - 300,000 → 20% profit
Price > Rp 300,000 - 500,000 → 25% profit
Price > Rp 500,000 → 30% profit
- nett_sales: Price after discount - nett_profit: Profit earned by Kimia Farma - rating_transaksi: Customer rating of the transaction

Files Provided

📌 kimia farma_query.txt – Contains SQL queries used for data analysis in Looker Studio
📌 kimia farma_analysis_table.csv – Preprocessed dataset ready for import and analysis

📢 Published on : My LinkedIn
Supply Chain DataSet
kaggle.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir Motefaker (2023). Supply Chain DataSet [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
Explore at:
zip(9340 bytes)Available download formats
Dataset updated
Jun 1, 2023
Authors
Amir Motefaker
Description
Supply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
Cloud Carbon Emissions Dataset
kaggle.com
zip
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nidhi Suryavanshi (2025). Cloud Carbon Emissions Dataset [Dataset]. https://www.kaggle.com/datasets/nidhis4444/cloud-carbon-emissions-dataset
Explore at:
zip(36611 bytes)Available download formats
Dataset updated
Sep 23, 2025
Authors
Nidhi Suryavanshi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a synthetic simulation of cloud resource usage and carbon emissions, designed for experimentation, analysis, and forecasting in sustainability and data engineering projects.

Included Tables: - projects → Metadata about projects/teams. - services → Metadata about cloud services (Compute, Storage, AI, etc.). - emission_factors → Regional grid carbon intensity (gCO₂ per kWh). - service_energy_coefficients → Conversion rates from usage units to kWh. - daily_usage → Raw service usage (per project × service × region × day). - daily_emissions → Carbon emissions derived from usage × regional emission factors. - service_cost_coefficients → Conversion rates from usage units to cost (USD per unit).
- daily_cost_emissions → Integrated fact table combining usage, energy, cost, and emissions for analysis.

Features: - Simulated seasonality (weekend dips/spikes, holiday surges, quarter-end growth). - Regional variations in carbon intensity (e.g., coal-heavy vs renewable grids). - Multiple projects and services for multi-dimensional analysis. - Directly importable into BigQuery for analytics & forecasting.

Use Cases: Explore sustainability analytics at scale. Build carbon footprint dashboards. Run AI/ML forecasting on emissions data. Practice SQL, data modeling, and visualization.

⚠️ Note: All data is synthetic and created for educational/demo purposes. It does not represent actual cloud provider emissions.
Blinkit dataset
kaggle.com
zip
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mukesh gadri (2024). Blinkit dataset [Dataset]. https://www.kaggle.com/datasets/mukeshgadri/blinkit-dataset
Explore at:
zip(695160 bytes)Available download formats
Dataset updated
Jul 18, 2024
Authors
mukesh gadri
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
In the case study titled "Blinkit: Grocery Product Analysis," a dataset called 'Grocery Sales' contains 12 columns with information on sales of grocery items across different outlets. Using Tableau, you as a data analyst can uncover customer behavior insights, track sales trends, and gather feedback. These insights will drive operational improvements, enhance customer satisfaction, and optimize product offerings and store layout. Tableau enables data-driven decision-making for positive outcomes at Blinkit.

The table Grocery Sales is a .CSV file and has the following columns, details of which are as follows:

• Item_Identifier: A unique ID for each product in the dataset. • Item_Weight: The weight of the product. • Item_Fat_Content: Indicates whether the product is low fat or not. • Item_Visibility: The percentage of the total display area in the store that is allocated to the specific product. • Item_Type: The category or type of product. • Item_MRP: The maximum retail price (list price) of the product. • Outlet_Identifier: A unique ID for each store in the dataset. • Outlet_Establishment_Year: The year in which the store was established. • Outlet_Size: The size of the store in terms of ground area covered. • Outlet_Location_Type: The type of city or region in which the store is located. • Outlet_Type: Indicates whether the store is a grocery store or a supermarket. • Item_Outlet_Sales: The sales of the product in the particular store. This is the outcome variable that we want to predict.
ascension
kaggle.com
zip
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cmumf93 (2025). ascension [Dataset]. https://www.kaggle.com/datasets/cmumford1993/ascension
Explore at:
zip(78996 bytes)Available download formats
Dataset updated
Nov 17, 2025
Authors
cmumf93
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a synthetic dataset inspired by the merchandise and supply-chain operations of a Christian publishing company. It was created to practice:

Product & channel performance analysis

Supply-chain and vendor risk assessment

Inventory and backorder monitoring

Basic forecasting and scenario planning

The data spans 2025-01-01 to 2025-06-30 and includes 10 products (studies, devotionals, rosaries, journals, and digital bundles) sold across four channels (Website, Parish Bulk, Amazon, and Events) in four US regions (Northeast, Midwest, South, West).

Dataset summary

Rows: 5,435 daily product–channel–region records

Products: 10

Channels: Website, Parish Bulk, Amazon, Event

Regions: Northeast, Midwest, South, West

Vendors: Multiple printers and vendors with different lead times and risk profiles

Each row describes the performance of a single product on a given date in a given channel, along with inventory and vendor information that can be used for operational risk analysis.

Columns

date – Calendar date for the record (YYYY-MM-DD).

product_id – Short ID for the product (e.g., BIBLE-STUDY-101).

product_name – Human-readable product name (e.g., Foundations Bible Study).

product_category – High-level category (Adult Study, Seasonal, Sacrament Prep, etc.).

format – Physical or Digital format.

channel – Sales channel (Website, Parish Bulk, Amazon, Event).

region – US region where the sale occurred (Northeast, Midwest, South, West).

vendor – Primary printer or vendor responsible for fulfilling that product.

units_sold – Number of units sold for that product/date/channel/region.

unit_price – Selling price per unit (USD).

revenue – Total revenue = units_sold * unit_price.

cogs_per_unit – Cost of goods sold per unit (approximate production/fulfillment cost).

gross_margin – Revenue minus total COGS for that row.

inventory_start – On-hand inventory at the start of the day.

inventory_end – On-hand inventory at the end of the day after sales.

backorder_flag – True if demand exceeded inventory and created a backorder, otherwise False.

lead_time_days – Typical replenishment lead time in days for that product/vendor combination.

What you can do with this dataset

This dataset is designed for:

Product & channel profitability

Rank products by total profit or margin.

Compare profitability across channels and regions.

Supply-chain & vendor risk

Identify products with long lead times and frequent backorders.

Flag higher-risk vendors (e.g., long lead times, tight inventory).

Inventory analytics

Track when inventory gets tight.

Explore safety stock ideas using inventory_start, inventory_end, and backorder_flag.

Forecasting & scenario planning

Build time-series forecasts of units sold or revenue.

Simulate what happens if one vendor fails or lead times increase.

Learning & practice

Practice SQL, Python, or R data analysis.

Build dashboards (Tableau, Power BI, etc.) or case-study style projects for a product or data-analytics portfolio.

Important notes

This is not real Ascension data; it is fully synthetic and safe to use publicly.

The structure was designed to resemble realistic publishing/merchandise operations, but the exact numbers and patterns were generated programmatically.

If you use this dataset in a notebook, blog post, or portfolio project, feel free to link back here so others can see how you approached the analysis.
E-Commerce Data
kaggle.com
zip
Updated Aug 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
Explore at:
zip(7548686 bytes)Available download formats
Dataset updated
Aug 17, 2017
Authors
Carrie
Description
Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.
Railway Management System
kaggle.com
zip
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MANISH SHARMA 95 (2023). Railway Management System [Dataset]. https://www.kaggle.com/datasets/manish9569/railway-management-system
Explore at:
zip(3026 bytes)Available download formats
Dataset updated
Oct 19, 2023
Authors
MANISH SHARMA 95
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
**I shared dataset for SQL basic project of railway management system in reservation area this data helpful for basic details of reservation ticket . ** I collect data from Wikipedia, GitHub, Kaggle and other sources so make this project for basic understanding and with some moderate queries of SQL this also help in queries of SQL.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gaurav B R (2023). IMDB Movies Analysis - SQL [Dataset]. https://www.kaggle.com/datasets/gauravbr/imdb-movies-data-erd

IMDB Movies Analysis - SQL

SQL IMDB Movies Analysis for RSVP (Film Production Company)

Explore at:

zip(3818401 bytes)Available download formats

Dataset updated

Feb 21, 2023

Authors

Gaurav B R

Description

SQL IMDB Movies Analysis for RSVP (Film Production Company)

RSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.

For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.

Clear search

Close search

Google apps

Main menu

IMDB Movies Analysis - SQL

SQL IMDB Movies Analysis for RSVP (Film Production Company)

Nvidia Database

Tables and Their Contents:

Products:

Customers:

Sales:

Suppliers:

Supply Chain:

Departments:

Employees:

Projects:

Why Use This Dataset?

Potential Use Cases:

Data Size:

SQL Integrity Journey: Unleashing Data Constraints

BookMyShow-SQL-Data-Analysis

Wikipedia SQLITE Portable DB, Huge 5M+ Rows

Healthcare Fraud Detection Dataset

SQLite Sakila Sample Database

SQLite Sakila Sample Database

Database Description

Files Description

Superstore Snowflake Schema Modeling Dataset

Hospital Management Dataset

Logistics Operations Database

Kaggle Dataset: Synthetic Logistics Operations Database (2022-2024)

About this Dataset

What's Inside

Who This Is For

Why This Dataset Exists

Dataset Structure

Key Features

Use Case Examples

Sample Questions to Explore

BigQuery Fintech Dataset

Tables:

Use Cases:

Model Car - Mint Classics

Kimia Farma: Performance Analysis 2020-2023

Project Scope

Current Progress

Dataset Overview

Files Provided

📢 Published on : My LinkedIn

Supply Chain DataSet

Cloud Carbon Emissions Dataset

Blinkit dataset

ascension

E-Commerce Data

Context

Content

Acknowledgements

Inspiration

Railway Management System

IMDB Movies Analysis - SQL

SQL IMDB Movies Analysis for RSVP (Film Production Company)

SQL IMDB Movies Analysis for RSVP (Film Production Company)