Facebook
TwitterRSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.
The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.
For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a beginner-friendly SQLite database designed to help users practice SQL and relational database concepts. The dataset represents a basic business model inspired by NVIDIA and includes interconnected tables covering essential aspects like products, customers, sales, suppliers, employees, and projects. It's perfect for anyone new to SQL or data analytics who wants to learn and experiment with structured data.
Includes details of 15 products (e.g., GPUs, AI accelerators). Attributes: product_id, product_name, category, release_date, price.
Lists 20 fictional customers with their industry and contact information. Attributes: customer_id, customer_name, industry, contact_email, contact_phone.
Contains 100 sales records tied to products and customers. Attributes: sale_id, product_id, customer_id, sale_date, region, quantity_sold, revenue.
Features 50 suppliers and the materials they provide. Attributes: supplier_id, supplier_name, material_supplied, contact_email.
Tracks materials supplied to produce products, proportional to sales. Attributes: supply_chain_id, supplier_id, product_id, supply_date, quantity_supplied.
Lists 5 departments within the business. Attributes: department_id, department_name, location.
Contains data on 30 employees and their roles in different departments. Attributes: employee_id, first_name, last_name, department_id, hire_date, salary.
Describes 10 projects handled by different departments. Attributes: project_id, project_name, department_id, start_date, end_date, budget.
Number of Tables: 8 Total Rows: Around 230 across all tables, ensuring quick queries and easy exploration.
Facebook
Twitter**Title: **Practical Exploration of SQL Constraints: Building a Foundation in Data Integrity Introduction: Welcome to my Data Analysis project, where I focus on mastering SQL constraints—a pivotal aspect of database management. This project centers on hands-on experience with SQL's Data Definition Language (DDL) commands, emphasizing constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, and DEFAULT. In this project, I aim to demonstrate my foundational understanding of enforcing data integrity and maintaining a structured database environment. Purpose: The primary purpose of this project is to showcase my proficiency in implementing and managing SQL constraints for robust data governance. By delving into the realm of constraints, you'll gain insights into my SQL skills and how I utilize constraints to ensure data accuracy, consistency, and reliability within relational databases. What to Expect: Within this project, you will find a series of projects that focus on the implementation and utilization of SQL constraints. These projects highlight my command over the following key constraint types: NOT NULL: The NOT NULL constraint is crucial for ensuring the presence of essential data in a column. PRIMARY KEY: Ensuring unique identification of records for data integrity. FOREIGN KEY: Establishing relationships between tables to maintain referential integrity. UNIQUE: Guaranteeing the uniqueness of values within specified columns. CHECK: Implementing custom conditions to validate data entries. DEFAULT: Setting default values for columns to enhance data reliability. Each exercise within this project is accompanied by clear and concise SQL scripts, explanations of the intended outcomes, and practical insights into the application of these constraints. My goal is to showcase how SQL constraints serve as crucial tools for creating a structured and dependable database foundation. I invite you to explore these projects in detail, where I provide hands-on examples that highlight the importance and utility of SQL constraints. Together, these projects underscore my commitment to upholding data quality, ensuring data accuracy, and harnessing the power of SQL constraints for informed decision-making in data analysis. 3.1 CONSTRAINT - ENFORCING NOT NULL CONSTRAINT WHILE CREATING NEW TABLE. 3.2 CONSTRAINT- ENFORCE NOT NULL CONSTRAINT ON EXISTING COLUMN. 3.3 CONSTRAINT - ENFORCING PRIMARY KEY CONSTRAINT WHILE CREATING A NEW TABLE. 3.4 CONSTRAINT - ENFORCE PRIMARY KEY CONSTRAINT ON EXISTING COLUMN. 3.5 CONSTRAINT - ENFORCING FOREIGN KEY CONSTRAINT WHILE CREATING NEW TABLE. 3.6 CONSTRAINT - ENFORCE FOREIGN KEY CONSTRAINT ON EXISTING COLUMN. 3.7CONSTRAINT - ENFORCING UNIQUE CONSTRAINTS WHILE CREATING A NEW TABLE. 3.8 CONSTRAINT - ENFORCING UNIQUE CONSTRAINT IN EXISTING TABLE. 3.9 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN NEW TABLE. 3.10 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN THE EXISTING TABLE. 3.11 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE NEW TABLE. 3.12 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE EXISTING TABLE.
Facebook
Twitter🎟️ BookMyShow SQL Data Analysis 🎯 Objective This project leverages SQL-based analysis to gain actionable insights into user engagement, movie performance, theater efficiency, payment systems, and customer satisfaction on the BookMyShow platform. The goal is to enhance platform performance, boost revenue, and optimize user experience through data-driven strategies.
📊 Key Analysis Areas 1. 👥 User Behavior & Engagement Identify most active users and repeat customers Track unique monthly users Analyze peak booking times and average tickets per user Drive engagement strategies and boost customer retention 2. 🎬 Movie Performance Analysis Highlight top-rated and most booked movies Analyze popular languages and high-revenue genres Study average occupancy rates Focus marketing on high-performing genres and content 3. 🏢 Theater & Show Performance Pinpoint theaters with highest/lowest bookings Evaluate popular show timings Measure theater-wise revenue contribution and occupancy Improve theater scheduling and resource allocation 4. 💵 Booking & Revenue Insights Track total revenue, top spenders, and monthly booking patterns Discover most used payment methods Calculate average price per booking and bookings per user Optimize revenue generation and spending strategies 5. 🪑 Seat Utilization & Pricing Strategy Identify most booked seat types and their revenue impact Analyze seat pricing variations and price elasticity Align pricing strategy with demand patterns for higher revenue 6. ✅❌ Payment & Transaction Analysis Distinguish successful vs. failed transactions Track refund frequency and payment delays Evaluate revenue lost due to failures Enhance payment processing systems 7. ⭐ User Reviews & Sentiment Analysis Measure average ratings per movie Identify top and lowest-rated content Analyze review volume and sentiment trends Leverage feedback to refine content offerings 🧰 Tech Stack Query Language: SQL (MySQL/PostgreSQL) Database Tools: DBeaver, pgAdmin, or any SQL IDE Visualization (Optional): Power BI / Tableau for presenting insights Version Control: Git & GitHub
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.
I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.
Key Features:
Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects
The database consists of four main tables:
This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.
https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data
Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings
Usage with LIKE queries: ``` import aiosqlite import asyncio
class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file
async def _aenter_(self):
self.conn = await aiosqlite.connect(self.db_file)
return self
async def _aexit_(self, exc_type, exc_val, exc_tb):
await self.conn.close()
async def search_pages_by_title(self, title):
query = """
SELECT pages.page_id, pages.item_id, pages.title, pages.views,
items.labels AS item_labels, items.description AS item_description,
link_annotated_text.sections
FROM pages
JOIN items ON pages.item_id = items.id
JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
WHERE pages.title LIKE ?
"""
async with self.conn.execute(query, (f"%{title}%",)) as cursor:
return await cursor.fetchall()
async def search_items_by_label_or_description(self, keyword):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ? OR description LIKE ?
"""
async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
return await cursor.fetchall()
async def search_items_by_label(self, label):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ?
"""
async with self.conn.execute(query, (f"%{label}%",)) as cursor:
return await cursor.fetchall()
async def search_properties_by_label_or_desc...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains comprehensive synthetic healthcare data designed for fraud detection analysis. It includes information on patients, healthcare providers, insurance claims, and payments. The dataset is structured to mimic real-world healthcare transactions, where fraudulent activities such as false claims, overbilling, and duplicate charges can be identified through advanced analytics.
The dataset is suitable for practicing SQL queries, exploratory data analysis (EDA), machine learning for fraud detection, and visualization techniques. It is designed to help data analysts and data scientists develop and refine their analytical skills in the healthcare insurance domain.
Dataset Overview The dataset consists of four CSV files:
Patients Data (patients.csv)
Contains demographic details of patients, such as age, gender, insurance type, and location. Can be used to analyze patient demographics and healthcare usage patterns. Providers Data (providers.csv)
Contains information about healthcare providers, including provider ID, specialty, location, and associated hospital.
Useful for identifying fraudulent claims linked to specific providers or hospitals. Claims Data (claims.csv)
Contains records of insurance claims made by patients, including diagnosis codes, treatment details, provider ID, and claim amount.
Can be analyzed for suspicious patterns, such as excessive claims from a single provider or duplicate claims for the same patient.
Payments Data (payments.csv) Contains details of claim payments made by insurance companies, including payment amount, claim ID, and reimbursement status.
Helps in detecting discrepancies between claims and actual reimbursements. Possible Analysis Ideas
This dataset allows for multiple analysis approaches, including but not limited to:
🔹 Fraud Detection: Identify patterns in claims data to detect fraudulent activities (e.g., excessive billing, duplicate claims). 🔹 Provider Behavior Analysis: Analyze providers who have an unusually high claim volume or high rejection rates. 🔹 Payment Trends: Compare claims vs. payments to find irregularities in reimbursement patterns. 🔹 Patient Demographics & Utilization: Explore which patient groups are more likely to file claims and receive reimbursements. 🔹 SQL Query Practice: Perform advanced SQL queries, including joins, aggregations, window functions, and subqueries, to extract insights from the data.
Use Cases Practicing SQL queries for job interviews and real-world projects. Learning data cleaning, data wrangling, and feature engineering for healthcare analytics. Applying machine learning techniques for fraud detection. Gaining insights into the healthcare insurance domain and its challenges.
License & Usage License: CC0 Public Domain (Free to use for any purpose).
Attribution: Not required but appreciated. Intended Use: This dataset is for educational and research purposes only.
This dataset is an excellent resource for aspiring data analysts, data scientists, and SQL learners who want to gain hands-on experience in healthcare fraud detection.
Facebook
TwitterThe Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/
Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:
Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance
License: BSD Copyright DB Software Laboratory http://www.etl-tools.com
Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html
Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/
Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db
https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila
Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/
The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:
sqlite> .open sqlite-sakila.db # creates the .db file
sqlite> .read sqlite-sakila-schema.sql # creates the database schema
sqlite> .read sqlite-sakila-insert-data.sql # inserts the data
Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset represents a Snowflake Schema model built from the popular Tableau Superstore dataset which exists primarily in a denormalized (flat) format.
This version is fully structured into fact and dimension tables, making it ready for data warehouse design, SQL analytics, and BI visualization projects.
The dataset was modeled to demonstrate dimensional modeling best practices, showing how the original flat Superstore data can be normalized into related dimensions and a central fact table.
Use this dataset to: - Practice SQL joins and schema design - Build ETL pipelines or dbt models - Design Power BI dashboards - Learn data warehouse normalization (3NF → Snowflake) concepts - Simulate enterprise data warehouse reporting environments
I’m open to suggestions or improvements from the community — feel free to share ideas on additional dimensions, measures, or transformations that could improve and make this dataset even more useful for learning and analysis.
Transformation was done using dbt, check out the models and the entire project.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.
Dataset Overview
This dataset includes five CSV files:
patients.csv – Patient demographics, contact details, registration info, and insurance data
doctors.csv – Doctor profiles with specializations, experience, and contact information
appointments.csv – Appointment dates, times, visit reasons, and statuses
treatments.csv – Treatment types, descriptions, dates, and associated costs
billing.csv – Billing amounts, payment methods, and status linked to treatments
📁 Files & Column Descriptions
** patients.csv**
Contains patient demographic and registration details.
Column Description
patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address
** doctors.csv**
Details about the doctors working in the hospital.
Column Description
doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address
appointments.csv
Records of scheduled and completed patient appointments.
Column Description
appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)
treatments.csv
Information about the treatments given during appointments.
Column Description
treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given
** billing.csv**
Billing and payment details for treatments.
Column Description
bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)
Possible Use Cases
SQL queries and relational database design
Exploratory data analysis (EDA) and dashboarding
Machine learning projects (e.g., cost prediction, no-show analysis)
Feature engineering and data cleaning practice
End-to-end healthcare analytics workflows
Recommended Tools & Resources
SQL (joins, filters, window functions)
Pandas and Matplotlib/Seaborn for EDA
Scikit-learn for ML models
Pandas Profiling for automated EDA
Plotly for interactive visualizations
Please Note that :
All data is synthetically generated for educational and project use. No real patient information is included.
If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A complete operational database from a fictional Class 8 trucking company spanning three years. This isn't scraped web data or simplified tutorial content—it's a realistic simulation built from 12 years of real-world logistics experience, designed specifically for analysts transitioning into supply chain and transportation domains.
The dataset contains 85,000+ records across 14 interconnected tables covering everything from driver assignments and fuel purchases to maintenance schedules and delivery performance. Each table maintains proper foreign key relationships, making this ideal for practicing complex SQL queries, building data pipelines, or developing operational dashboards.
SQL Learners: Master window functions, CTEs, and multi-table JOINs using realistic business scenarios rather than contrived examples.
Data Analysts: Build portfolio projects that demonstrate understanding of operational metrics: cost-per-mile analysis, fleet utilization optimization, driver performance scorecards.
Aspiring Supply Chain Analysts: Work with authentic logistics data patterns—seasonal freight volumes, equipment utilization rates, route profitability calculations—without NDA restrictions.
Data Science Students: Develop predictive models for maintenance scheduling, driver retention, or route optimization using time-series data with actual business context.
Career Changers: If you're moving from operations into analytics (like the dataset creator), this provides a bridge—your domain knowledge becomes a competitive advantage rather than a gap to explain.
Most logistics datasets are either proprietary (unavailable) or overly simplified (unrealistic). This fills the gap: operational complexity without confidentiality concerns. The data reflects real industry patterns:
Core Entities (Reference Tables): - Drivers (150 records) - Demographics, employment history, CDL info - Trucks (120 records) - Fleet specs, acquisition dates, status - Trailers (180 records) - Equipment types, current assignments - Customers (200 records) - Shipper accounts, contract terms, revenue potential - Facilities (50 records) - Terminals and warehouses with geocoordinates - Routes (60+ records) - City pairs with distances and rate structures
Operational Transactions: - Loads (57,000+ records) - Shipment details, revenue, booking type - Trips (57,000+ records) - Driver-truck assignments, actual performance - Fuel Purchases (131,000+ records) - Transaction-level data with pricing - Maintenance Records (6,500+ records) - Service history, costs, downtime - Delivery Events (114,000+ records) - Pickup/delivery timestamps, detention - Safety Incidents (114 records) - Accidents, violations, claims
Aggregated Analytics: - Driver Monthly Metrics (5,400+ records) - Performance summaries - Truck Utilization Metrics (3,800+ records) - Equipment efficiency
Temporal Coverage: January 2022 through December 2024 (3 years)
Geographic Scope: National operations across 25+ major US cities
Realistic Patterns: - Seasonal freight fluctuations (Q4 peaks) - Historical fuel price accuracy - Equipment lifecycle modeling - Driver retention dynamics - Service level variations
Data Quality: - Complete foreign key integrity - No orphaned records - Intentional 2% null rate in driver/truck assignments (reflects reality) - All timestamps properly sequenced - Financial calculations verified
Business Intelligence: Create executive dashboards showing revenue per truck, cost per mile, driver efficiency rankings, maintenance spend by equipment age, customer concentration risk.
Predictive Analytics: Build models forecasting equipment failures based on maintenance history, predict driver turnover using performance metrics, estimate route profitability for new lanes.
Operations Optimization: Analyze route efficiency, identify underutilized assets, optimize maintenance scheduling, calculate ideal fleet size, evaluate driver-to-truck ratios.
SQL Mastery: Practice window functions for running totals and rankings, write complex JOINs across 6+ tables, implement CTEs for hierarchical queries, perform cohort analysis on driver retention.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset: cloud-training-demos.fintech
This dataset, hosted on BigQuery, is designed for financial technology (fintech) training and analysis. It comprises six interconnected tables, each providing detailed insights into various aspects of customer loans, loan purposes, and regional distributions. The dataset is ideal for practicing SQL queries, building data models, and conducting financial analytics.
customer:
Contains records of individual customers, including demographic details and unique customer IDs. This table serves as a primary reference for analyzing customer behavior and loan distribution.
loan:
Includes detailed information about each loan issued, such as the loan amount, interest rate, and tenure. The table is crucial for analyzing lending patterns and financial outcomes.
loan_count_by_year:
Provides aggregated loan data by year, offering insights into yearly lending trends. This table helps in understanding the temporal dynamics of loan issuance.
loan_purposes:
Lists various reasons or purposes for which loans were issued, along with corresponding loan counts. This data can be used to analyze customer needs and market demands.
loan_with_region:
Combines loan data with regional information, allowing for geographical analysis of lending activities. This table is key for regional market analysis and understanding how loan distribution varies across different areas.
state_region:
Maps state names to their respective regions, enabling a more granular geographical analysis when combined with other tables in the dataset.
loan_count_by_year table to observe how lending patterns evolve over time.This dataset is ideal for those looking to enhance their skills in SQL, financial data analysis, and BigQuery, providing a comprehensive foundation for fintech-related projects and case studies.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Mint Classics Company, a retailer of classic model cars and other vehicles, is looking at closing one of their storage facilities.
To support a data-based business decision, they are looking for suggestions and recommendations for reorganizing or reducing inventory, while still maintaining timely service to their customers. For example, they would like to be able to ship a product to a customer within 24 hours of the order being placed.
As a data analyst, you have been asked to use MySQL Workbench to familiarize yourself with the general business by examining the current data. You will be provided with a data model and sample data tables to review. You will then need to isolate and identify those parts of the data that could be useful in deciding how to reduce inventory. You will write queries to answer questions like these:
1) Where are items stored and if they were rearranged, could a warehouse be eliminated?
2) How are inventory numbers related to sales figures? Do the inventory counts seem appropriate for each item?
3) Are we storing items that are not moving? Are any items candidates for being dropped from the product line?
The answers to questions like those should help you to formulate suggestions and recommendations for reducing inventory with the goal of closing one of the storage facilities.
Project Objectives
Explore products currently in inventory.
Determine important factors that may influence inventory reorganization/reduction.
Provide analytic insights and data-driven recommendations.
Your Challenge
Your challenge will be to conduct an exploratory data analysis to investigate if there are any patterns or themes that may influence the reduction or reorganization of inventory in the Mint Classics storage facilities. To do this, you will import the database and then analyze data. You will also pose questions, and seek to answer them meaningfully using SQL queries to retrieve data from the database provided.
In this project, we'll use the fictional Mint Classics relational database and a relational data model. Both will be provided.
After you perform your analysis, you will share your findings.
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19062145%2F025ccf521f62db512b4a98edd0b3508a%2FKimia_Farma_Dashboard.jpg?generation=1748428094441761&alt=media" alt="">This project analyzes Kimia Farma's performance from 2020 to 2023 using Google Looker Studio. The analysis is based on a pre-processed dataset stored in BigQuery, which serves as the data source for the dashboard.
The dashboard is designed to provide insights into branch performance, sales trends, customer ratings, and profitability. The development is ongoing, with multiple pages planned for a more in-depth analysis.
✅ The first page of the dashboard is completed
✅ A sample dashboard file is available on Kaggle
🔄 Development will continue with additional pages
The dataset consists of transaction records from Kimia Farma branches across different cities and provinces. Below are the key columns used in the analysis:
- transaction_id: Transaction ID code
- date: Transaction date
- branch_id: Kimia Farma branch ID code
- branch_name: Kimia Farma branch name
- kota: City of the Kimia Farma branch
- provinsi: Province of the Kimia Farma branch
- rating_cabang: Customer rating of the Kimia Farma branch
- customer_name: Name of the customer who made the transaction
- product_id: Product ID code
- product_name: Name of the medicine
- actual_price: Price of the medicine
- discount_percentage: Discount percentage applied to the medicine
- persentase_gross_laba: Gross profit percentage based on the following conditions:
Price ≤ Rp 50,000 → 10% profit
Price > Rp 50,000 - 100,000 → 15% profit
Price > Rp 100,000 - 300,000 → 20% profit
Price > Rp 300,000 - 500,000 → 25% profit
Price > Rp 500,000 → 30% profit
- nett_sales: Price after discount
- nett_profit: Profit earned by Kimia Farma
- rating_transaksi: Customer rating of the transaction
📌 kimia farma_query.txt – Contains SQL queries used for data analysis in Looker Studio
📌 kimia farma_analysis_table.csv – Preprocessed dataset ready for import and analysis
Facebook
TwitterSupply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a synthetic simulation of cloud resource usage and carbon emissions, designed for experimentation, analysis, and forecasting in sustainability and data engineering projects.
Included Tables:
- projects → Metadata about projects/teams.
- services → Metadata about cloud services (Compute, Storage, AI, etc.).
- emission_factors → Regional grid carbon intensity (gCO₂ per kWh).
- service_energy_coefficients → Conversion rates from usage units to kWh.
- daily_usage → Raw service usage (per project × service × region × day).
- daily_emissions → Carbon emissions derived from usage × regional emission factors.
- service_cost_coefficients → Conversion rates from usage units to cost (USD per unit).
- daily_cost_emissions → Integrated fact table combining usage, energy, cost, and emissions for analysis.
Features: - Simulated seasonality (weekend dips/spikes, holiday surges, quarter-end growth). - Regional variations in carbon intensity (e.g., coal-heavy vs renewable grids). - Multiple projects and services for multi-dimensional analysis. - Directly importable into BigQuery for analytics & forecasting.
Use Cases: Explore sustainability analytics at scale. Build carbon footprint dashboards. Run AI/ML forecasting on emissions data. Practice SQL, data modeling, and visualization.
⚠️ Note: All data is synthetic and created for educational/demo purposes. It does not represent actual cloud provider emissions.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
In the case study titled "Blinkit: Grocery Product Analysis," a dataset called 'Grocery Sales' contains 12 columns with information on sales of grocery items across different outlets. Using Tableau, you as a data analyst can uncover customer behavior insights, track sales trends, and gather feedback. These insights will drive operational improvements, enhance customer satisfaction, and optimize product offerings and store layout. Tableau enables data-driven decision-making for positive outcomes at Blinkit.
The table Grocery Sales is a .CSV file and has the following columns, details of which are as follows:
• Item_Identifier: A unique ID for each product in the dataset. • Item_Weight: The weight of the product. • Item_Fat_Content: Indicates whether the product is low fat or not. • Item_Visibility: The percentage of the total display area in the store that is allocated to the specific product. • Item_Type: The category or type of product. • Item_MRP: The maximum retail price (list price) of the product. • Outlet_Identifier: A unique ID for each store in the dataset. • Outlet_Establishment_Year: The year in which the store was established. • Outlet_Size: The size of the store in terms of ground area covered. • Outlet_Location_Type: The type of city or region in which the store is located. • Outlet_Type: Indicates whether the store is a grocery store or a supermarket. • Item_Outlet_Sales: The sales of the product in the particular store. This is the outcome variable that we want to predict.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a synthetic dataset inspired by the merchandise and supply-chain operations of a Christian publishing company. It was created to practice:
Product & channel performance analysis
Supply-chain and vendor risk assessment
Inventory and backorder monitoring
Basic forecasting and scenario planning
The data spans 2025-01-01 to 2025-06-30 and includes 10 products (studies, devotionals, rosaries, journals, and digital bundles) sold across four channels (Website, Parish Bulk, Amazon, and Events) in four US regions (Northeast, Midwest, South, West).
Dataset summary
Rows: 5,435 daily product–channel–region records
Products: 10
Channels: Website, Parish Bulk, Amazon, Event
Regions: Northeast, Midwest, South, West
Vendors: Multiple printers and vendors with different lead times and risk profiles
Each row describes the performance of a single product on a given date in a given channel, along with inventory and vendor information that can be used for operational risk analysis.
Columns
date – Calendar date for the record (YYYY-MM-DD).
product_id – Short ID for the product (e.g., BIBLE-STUDY-101).
product_name – Human-readable product name (e.g., Foundations Bible Study).
product_category – High-level category (Adult Study, Seasonal, Sacrament Prep, etc.).
format – Physical or Digital format.
channel – Sales channel (Website, Parish Bulk, Amazon, Event).
region – US region where the sale occurred (Northeast, Midwest, South, West).
vendor – Primary printer or vendor responsible for fulfilling that product.
units_sold – Number of units sold for that product/date/channel/region.
unit_price – Selling price per unit (USD).
revenue – Total revenue = units_sold * unit_price.
cogs_per_unit – Cost of goods sold per unit (approximate production/fulfillment cost).
gross_margin – Revenue minus total COGS for that row.
inventory_start – On-hand inventory at the start of the day.
inventory_end – On-hand inventory at the end of the day after sales.
backorder_flag – True if demand exceeded inventory and created a backorder, otherwise False.
lead_time_days – Typical replenishment lead time in days for that product/vendor combination.
What you can do with this dataset
This dataset is designed for:
Product & channel profitability
Rank products by total profit or margin.
Compare profitability across channels and regions.
Supply-chain & vendor risk
Identify products with long lead times and frequent backorders.
Flag higher-risk vendors (e.g., long lead times, tight inventory).
Inventory analytics
Track when inventory gets tight.
Explore safety stock ideas using inventory_start, inventory_end, and backorder_flag.
Forecasting & scenario planning
Build time-series forecasts of units sold or revenue.
Simulate what happens if one vendor fails or lead times increase.
Learning & practice
Practice SQL, Python, or R data analysis.
Build dashboards (Tableau, Power BI, etc.) or case-study style projects for a product or data-analytics portfolio.
Important notes
This is not real Ascension data; it is fully synthetic and safe to use publicly.
The structure was designed to resemble realistic publishing/merchandise operations, but the exact numbers and patterns were generated programmatically.
If you use this dataset in a notebook, blog post, or portfolio project, feel free to link back here so others can see how you approached the analysis.
Facebook
TwitterTypically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Image from stocksnap.io.
Analyses for this dataset could include time series, clustering, classification and more.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
**I shared dataset for SQL basic project of railway management system in reservation area this data helpful for basic details of reservation ticket . ** I collect data from Wikipedia, GitHub, Kaggle and other sources so make this project for basic understanding and with some moderate queries of SQL this also help in queries of SQL.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterRSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.
The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.
For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.