Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.
Included Files:
sp500_cleaned.csv – Cleaned dataset used for analysis
sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)
dashboard_screenshot.png – Screenshot of Power BI dashboard
README.md – Summary of the project and key takeaways
This project demonstrates practical data cleaning, querying, and visualization skills.
Facebook
TwitterRSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.
The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.
For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a beginner-friendly SQLite database designed to help users practice SQL and relational database concepts. The dataset represents a basic business model inspired by NVIDIA and includes interconnected tables covering essential aspects like products, customers, sales, suppliers, employees, and projects. It's perfect for anyone new to SQL or data analytics who wants to learn and experiment with structured data.
Includes details of 15 products (e.g., GPUs, AI accelerators). Attributes: product_id, product_name, category, release_date, price.
Lists 20 fictional customers with their industry and contact information. Attributes: customer_id, customer_name, industry, contact_email, contact_phone.
Contains 100 sales records tied to products and customers. Attributes: sale_id, product_id, customer_id, sale_date, region, quantity_sold, revenue.
Features 50 suppliers and the materials they provide. Attributes: supplier_id, supplier_name, material_supplied, contact_email.
Tracks materials supplied to produce products, proportional to sales. Attributes: supply_chain_id, supplier_id, product_id, supply_date, quantity_supplied.
Lists 5 departments within the business. Attributes: department_id, department_name, location.
Contains data on 30 employees and their roles in different departments. Attributes: employee_id, first_name, last_name, department_id, hire_date, salary.
Describes 10 projects handled by different departments. Attributes: project_id, project_name, department_id, start_date, end_date, budget.
Number of Tables: 8 Total Rows: Around 230 across all tables, ensuring quick queries and easy exploration.
Facebook
TwitterSQL Case Study Project: Employee Database Analysis 📊
I recently completed a comprehensive SQL project involving a simulated employee database with multiple tables:
In this project, I practiced and applied a wide range of SQL concepts:
✅ Simple Queries ✅ Filtering with WHERE conditions ✅ Sorting with ORDER BY ✅ Aggregation using GROUP BY and HAVING ✅ Multi-table JOINs ✅ Conditional Logic using CASE ✅ Subqueries and Set Operators
💡 Key Highlights:
🛠️ Tools Used: Azure Data Studio
📂 You can find the entire project and scripts here:
👉 https://github.com/RiddhiNDivecha/Employee-Database-Analysis
This project helped me sharpen my SQL skills and understand business logic more deeply in a practical context.
💬 I’m open to feedback and happy to connect with fellow data enthusiasts!
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset used in this project is inspired by the HR Analytics: Job Change of Data Scientists dataset available on Kaggle. It contains information about candidates’ demographics, education, work experience, company details, and training hours, aiming to predict whether a candidate is likely to seek a new job. This simulation recreates the structure of the original dataset in a lightweight SQLite environment to demonstrate SQL operations in Python. It provides an ideal context for learning and practicing essential SQL commands such as CREATE, INSERT, SELECT, JOIN, and more, using realistic HR data scenarios.
Facebook
Twitter**Title: **Practical Exploration of SQL Constraints: Building a Foundation in Data Integrity Introduction: Welcome to my Data Analysis project, where I focus on mastering SQL constraints—a pivotal aspect of database management. This project centers on hands-on experience with SQL's Data Definition Language (DDL) commands, emphasizing constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, and DEFAULT. In this project, I aim to demonstrate my foundational understanding of enforcing data integrity and maintaining a structured database environment. Purpose: The primary purpose of this project is to showcase my proficiency in implementing and managing SQL constraints for robust data governance. By delving into the realm of constraints, you'll gain insights into my SQL skills and how I utilize constraints to ensure data accuracy, consistency, and reliability within relational databases. What to Expect: Within this project, you will find a series of projects that focus on the implementation and utilization of SQL constraints. These projects highlight my command over the following key constraint types: NOT NULL: The NOT NULL constraint is crucial for ensuring the presence of essential data in a column. PRIMARY KEY: Ensuring unique identification of records for data integrity. FOREIGN KEY: Establishing relationships between tables to maintain referential integrity. UNIQUE: Guaranteeing the uniqueness of values within specified columns. CHECK: Implementing custom conditions to validate data entries. DEFAULT: Setting default values for columns to enhance data reliability. Each exercise within this project is accompanied by clear and concise SQL scripts, explanations of the intended outcomes, and practical insights into the application of these constraints. My goal is to showcase how SQL constraints serve as crucial tools for creating a structured and dependable database foundation. I invite you to explore these projects in detail, where I provide hands-on examples that highlight the importance and utility of SQL constraints. Together, these projects underscore my commitment to upholding data quality, ensuring data accuracy, and harnessing the power of SQL constraints for informed decision-making in data analysis. 3.1 CONSTRAINT - ENFORCING NOT NULL CONSTRAINT WHILE CREATING NEW TABLE. 3.2 CONSTRAINT- ENFORCE NOT NULL CONSTRAINT ON EXISTING COLUMN. 3.3 CONSTRAINT - ENFORCING PRIMARY KEY CONSTRAINT WHILE CREATING A NEW TABLE. 3.4 CONSTRAINT - ENFORCE PRIMARY KEY CONSTRAINT ON EXISTING COLUMN. 3.5 CONSTRAINT - ENFORCING FOREIGN KEY CONSTRAINT WHILE CREATING NEW TABLE. 3.6 CONSTRAINT - ENFORCE FOREIGN KEY CONSTRAINT ON EXISTING COLUMN. 3.7CONSTRAINT - ENFORCING UNIQUE CONSTRAINTS WHILE CREATING A NEW TABLE. 3.8 CONSTRAINT - ENFORCING UNIQUE CONSTRAINT IN EXISTING TABLE. 3.9 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN NEW TABLE. 3.10 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN THE EXISTING TABLE. 3.11 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE NEW TABLE. 3.12 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE EXISTING TABLE.
Facebook
Twitter🎟️ BookMyShow SQL Data Analysis 🎯 Objective This project leverages SQL-based analysis to gain actionable insights into user engagement, movie performance, theater efficiency, payment systems, and customer satisfaction on the BookMyShow platform. The goal is to enhance platform performance, boost revenue, and optimize user experience through data-driven strategies.
📊 Key Analysis Areas 1. 👥 User Behavior & Engagement Identify most active users and repeat customers Track unique monthly users Analyze peak booking times and average tickets per user Drive engagement strategies and boost customer retention 2. 🎬 Movie Performance Analysis Highlight top-rated and most booked movies Analyze popular languages and high-revenue genres Study average occupancy rates Focus marketing on high-performing genres and content 3. 🏢 Theater & Show Performance Pinpoint theaters with highest/lowest bookings Evaluate popular show timings Measure theater-wise revenue contribution and occupancy Improve theater scheduling and resource allocation 4. 💵 Booking & Revenue Insights Track total revenue, top spenders, and monthly booking patterns Discover most used payment methods Calculate average price per booking and bookings per user Optimize revenue generation and spending strategies 5. 🪑 Seat Utilization & Pricing Strategy Identify most booked seat types and their revenue impact Analyze seat pricing variations and price elasticity Align pricing strategy with demand patterns for higher revenue 6. ✅❌ Payment & Transaction Analysis Distinguish successful vs. failed transactions Track refund frequency and payment delays Evaluate revenue lost due to failures Enhance payment processing systems 7. ⭐ User Reviews & Sentiment Analysis Measure average ratings per movie Identify top and lowest-rated content Analyze review volume and sentiment trends Leverage feedback to refine content offerings 🧰 Tech Stack Query Language: SQL (MySQL/PostgreSQL) Database Tools: DBeaver, pgAdmin, or any SQL IDE Visualization (Optional): Power BI / Tableau for presenting insights Version Control: Git & GitHub
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📖 Dataset Description
This dataset provides an end-to-end view of vendor performance across multiple dimensions — purchases, sales, inventory, pricing, and invoices. It is designed for data analytics, visualization, and business intelligence projects, making it ideal for learners and professionals exploring procurement, vendor management, and supply chain optimization.
🔗 GitHub Project (Code + Power BI Dashboard): Vendor Performance Analysis[https://github.com/HARSH-MADHAVAN/Vendor-Performance-Analysis]
The dataset includes:
purchases.csv → Detailed vendor purchase transactions sales.csv → Sales performance data linked to vendors inventory.csv (begin & end) → Stock levels at different periods purchase_prices.csv → Historical vendor pricing vendor_invoice.csv → Invoice details for reconciliation vendor_sales_summary.csv → Aggregated vendor-wise sales insights
Use this dataset to practice:
SQL querying & data modeling Python analytics & preprocessing Power BI dashboarding & reporting
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset: cloud-training-demos.fintech
This dataset, hosted on BigQuery, is designed for financial technology (fintech) training and analysis. It comprises six interconnected tables, each providing detailed insights into various aspects of customer loans, loan purposes, and regional distributions. The dataset is ideal for practicing SQL queries, building data models, and conducting financial analytics.
customer:
Contains records of individual customers, including demographic details and unique customer IDs. This table serves as a primary reference for analyzing customer behavior and loan distribution.
loan:
Includes detailed information about each loan issued, such as the loan amount, interest rate, and tenure. The table is crucial for analyzing lending patterns and financial outcomes.
loan_count_by_year:
Provides aggregated loan data by year, offering insights into yearly lending trends. This table helps in understanding the temporal dynamics of loan issuance.
loan_purposes:
Lists various reasons or purposes for which loans were issued, along with corresponding loan counts. This data can be used to analyze customer needs and market demands.
loan_with_region:
Combines loan data with regional information, allowing for geographical analysis of lending activities. This table is key for regional market analysis and understanding how loan distribution varies across different areas.
state_region:
Maps state names to their respective regions, enabling a more granular geographical analysis when combined with other tables in the dataset.
loan_count_by_year table to observe how lending patterns evolve over time.This dataset is ideal for those looking to enhance their skills in SQL, financial data analysis, and BigQuery, providing a comprehensive foundation for fintech-related projects and case studies.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.
I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.
Key Features:
Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects
The database consists of four main tables:
This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.
https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data
Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings
Usage with LIKE queries: ``` import aiosqlite import asyncio
class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file
async def _aenter_(self):
self.conn = await aiosqlite.connect(self.db_file)
return self
async def _aexit_(self, exc_type, exc_val, exc_tb):
await self.conn.close()
async def search_pages_by_title(self, title):
query = """
SELECT pages.page_id, pages.item_id, pages.title, pages.views,
items.labels AS item_labels, items.description AS item_description,
link_annotated_text.sections
FROM pages
JOIN items ON pages.item_id = items.id
JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
WHERE pages.title LIKE ?
"""
async with self.conn.execute(query, (f"%{title}%",)) as cursor:
return await cursor.fetchall()
async def search_items_by_label_or_description(self, keyword):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ? OR description LIKE ?
"""
async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
return await cursor.fetchall()
async def search_items_by_label(self, label):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ?
"""
async with self.conn.execute(query, (f"%{label}%",)) as cursor:
return await cursor.fetchall()
async def search_properties_by_label_or_desc...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This project was a powerful introduction to the practical application of database design and SQL in a real-world scenario. It helped me understand how a well-structured relational database supports business scalability and data integrity — especially for businesses transitioning from flat files like spreadsheets to a more robust system.
One key takeaway for me was the importance of normalizing data, not just to reduce redundancy but to ensure that information is easily queryable and future-proof. Working with MySQL Workbench also gave me hands-on experience in visual database modeling, which made the conceptual relationships between tables much clearer.
While I encountered a few challenges setting up MySQL Workbench and configuring the database connections, overcoming those technical steps gave me more confidence in managing development tools — a crucial skill for both data analysts and back-end developers.
If I were to extend this project in the future, I would consider:
Adding tables for inventory management, supplier information, or delivery tracking
Building simple data dashboards to visualize sales and product performance
Automating the data import process from CSV to SQL
Overall, this project bridged the gap between theory and practical application. It deepened my understanding of how structured data can unlock powerful insights and better decision-making for businesses.
Facebook
TwitterThe Practical Exercise in SQL Data Definition Language (DDL) Commands is a hands-on project designed to help you gain a deep understanding of fundamental DDL commands in SQL, including:
This project aims to enhance your proficiency in using SQL to create, modify, and manage database structures effectively.
1.1 DDL-CREATE TABLE
1.2 DDL-ALTER TABLE(ADD)
1.3 DDL-ALTER(RENAME COLUMN NAME)
1.4 DDL-ALTER(RENAME TABLE NAME)
1.5 DDL-ALTER(DROP COLUMN FROM TABLE)
1.6 DDL-ALTER(DROP TABLE)
1.7 DDL- TRUNCATE TABLE
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Comprehensive Amazon India sales dataset featuring 15,000 synthetic e-commerce transactions from 2025. This cleaned and validated dataset captures real-world shopping patterns including customer behavior, product preferences, payment methods, delivery metrics, and regional sales distribution across Indian states.
Key Features: - 15,000 orders across multiple product categories (Electronics, Clothing, Home & Kitchen, Beauty) - Daily OHLCV-style transactional data from January to December 2025 - Complete customer journey: Order placement, payment, delivery, and review - Geographic coverage across major Indian states - Payment method diversity: Credit Card, Debit Card, UPI, Cash on Delivery - Delivery status tracking: Delivered, Pending, Returned - Customer review ratings and sentiment analysis
Dataset Columns (14): Order_ID, Date, Customer_ID, Product_Category, Product_Name, Quantity, Unit_Price_INR, Total_Sales_INR, Payment_Method, Delivery_Status, Review_Rating, Review_Text, State, Country
Use Cases: - E-commerce sales analysis and forecasting - Customer behavior and segmentation studies - Payment method preference analysis - Regional market trends and geographic insights - Delivery optimization and logistics planning - Product performance and category analysis - Customer satisfaction and review analysis - SQL practice and business intelligence training
Data Quality: - Cleaned and validated for analysis - No missing values in critical fields - Consistent data types and formatting - Ready for immediate SQL/Python analysis
Perfect for data analysts, SQL learners, business intelligence projects, and e-commerce analytics practice!
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a synthetic yet realistic E-commerce retail dataset generated programmatically using Python (Faker + NumPy + Pandas).
It is designed to closely mimic real-world online shopping behavior, user patterns, product interactions, seasonal trends, and marketplace events.
Machine Learning & Deep Learning
Recommender Systems
Customer Segmentation
Sales Forecasting
A/B Testing
E-commerce Behaviour Analysis
Data Cleaning / Feature Engineering Practice
SQL practice
The dataset contains 6 CSV files: ~~~ File Rows Description users.csv ~10,000 User profiles, demographics & signup info products.csv ~2,000 Product catalog with rating and pricing orders.csv ~20,000 Order-level transactions order_items.csv ~60,000 Items purchased per order reviews.csv ~15,000 Customer-written product reviews events.csv ~80,000 User event logs: view, cart, wishlist, purchase ~~~
1. Users (users.csv)
Column Description
user_id Unique user identifier
name Full customer name
email Email (synthetic, no real emails)
gender Male / Female / Other
city City of residence
signup_date Account creation date
2. Products (products.csv)
Column Description
product_id Unique product identifier
product_name Product title
category Electronics, Clothing, Beauty, Home, Sports, etc.
price Actual selling price
rating Average product rating
3. Orders (orders.csv)
Column Description
order_id Unique order identifier
user_id User who placed the order
order_date Timestamp of the order
order_status Completed / Cancelled / Returned
total_amount Total order value
4. Order Items (order_items.csv)
Column Description
order_item_id Unique identifier
order_id Associated order
product_id Purchased product
quantity Quantity purchased
item_price Price per unit
5. Reviews (reviews.csv)
Column Description
review_id Unique review identifier
user_id User who submitted review
product_id Reviewed product
rating 1–5 star rating
review_text Short synthetic review
review_date Submission date
6. Events (events.csv)
Column Description
event_id Unique event identifier
user_id User performing event
product_id Viewed/added/purchased product
event_type view/cart/wishlist/purchase
event_timestamp Timestamp of event
Customer churn prediction
Review sentiment analysis (NLP)
Recommendation engines
Price optimization models
Demand forecasting (Time-series)
Market basket analysis
RFM segmentation
Cohort analysis
Funnel conversion tracking
A/B testing simulations
Joins
Window functions
Aggregations
CTE-based funnels
Complex queries
Faker for realistic user and review generation
NumPy for probability-based event modeling
Pandas for data processing
demand variation
user behavior simulation
return/cancel probabilities
seasonal order timestamp distribution
The dataset does not include any real personal data.
Everything is generated synthetically.
This dataset is released under CC BY 4.0 — free to use for:
Research
Education
Commercial projects
Kaggle competitions
Machine learning pipelines
Just provide attribution.
Upvote the dataset
Leave a comment
Share your notebooks/notebooks using it
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains comprehensive synthetic healthcare data designed for fraud detection analysis. It includes information on patients, healthcare providers, insurance claims, and payments. The dataset is structured to mimic real-world healthcare transactions, where fraudulent activities such as false claims, overbilling, and duplicate charges can be identified through advanced analytics.
The dataset is suitable for practicing SQL queries, exploratory data analysis (EDA), machine learning for fraud detection, and visualization techniques. It is designed to help data analysts and data scientists develop and refine their analytical skills in the healthcare insurance domain.
Dataset Overview The dataset consists of four CSV files:
Patients Data (patients.csv)
Contains demographic details of patients, such as age, gender, insurance type, and location. Can be used to analyze patient demographics and healthcare usage patterns. Providers Data (providers.csv)
Contains information about healthcare providers, including provider ID, specialty, location, and associated hospital.
Useful for identifying fraudulent claims linked to specific providers or hospitals. Claims Data (claims.csv)
Contains records of insurance claims made by patients, including diagnosis codes, treatment details, provider ID, and claim amount.
Can be analyzed for suspicious patterns, such as excessive claims from a single provider or duplicate claims for the same patient.
Payments Data (payments.csv) Contains details of claim payments made by insurance companies, including payment amount, claim ID, and reimbursement status.
Helps in detecting discrepancies between claims and actual reimbursements. Possible Analysis Ideas
This dataset allows for multiple analysis approaches, including but not limited to:
🔹 Fraud Detection: Identify patterns in claims data to detect fraudulent activities (e.g., excessive billing, duplicate claims). 🔹 Provider Behavior Analysis: Analyze providers who have an unusually high claim volume or high rejection rates. 🔹 Payment Trends: Compare claims vs. payments to find irregularities in reimbursement patterns. 🔹 Patient Demographics & Utilization: Explore which patient groups are more likely to file claims and receive reimbursements. 🔹 SQL Query Practice: Perform advanced SQL queries, including joins, aggregations, window functions, and subqueries, to extract insights from the data.
Use Cases Practicing SQL queries for job interviews and real-world projects. Learning data cleaning, data wrangling, and feature engineering for healthcare analytics. Applying machine learning techniques for fraud detection. Gaining insights into the healthcare insurance domain and its challenges.
License & Usage License: CC0 Public Domain (Free to use for any purpose).
Attribution: Not required but appreciated. Intended Use: This dataset is for educational and research purposes only.
This dataset is an excellent resource for aspiring data analysts, data scientists, and SQL learners who want to gain hands-on experience in healthcare fraud detection.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Highlighting both practice projects and contributions I've made on the job, with a focus on practical, results-driven analysis. Each project reflects my ability to solve business problems using tools like Excel for data visualization, SQL for querying and structuring data, and the skills I've built in Python.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Complete data engineering project on 4 years (2014-2017) of retail sales transactions.
DATASET CONTENTS: - Original denormalized data (9,994 rows) - Normalized database: 4 tables (customers, orders, products, sales) - 9 SQL analysis files organized by phase - Complete EDA from data cleaning to business insights
DATABASE TABLES:
- customers: 793 records
- orders: 4,931 records
- products: 1,812 records
- sales: 9,686 transactions
KEY FINDINGS: - Low profitability: 12.44% margin (below industry standard) - Discount problem: 50%+ transactions have 20%+ discounts - Loss-making: 18.66% of transactions lose money - Furniture crisis: Only 2.31% margin - Small baskets: Only 1.96 items per order
SQL SKILLS DEMONSTRATED: ✓ Window functions (ROW_NUMBER, PARTITION BY) ✓ Database normalization (3NF) ✓ Complex JOINs (3-4 tables) ✓ Data deduplication with CTEs ✓ Business analytics queries ✓ CASE statements and aggregations
PERFECT FOR: - SQL practice (beginner to advanced) - Database normalization learning - EDA methodology study - Business analytics projects - Data engineering portfolios
FILES INCLUDED: - 5 CSV files (original + 4 normalized tables) - 9 SQL query files (cleaning, migration, analysis)
Author: Nawaf Alzzeer License: CC BY-SA 4.0
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides a comprehensive view of retail operations, combining sales transactions, return records, and shipping cost details into one analysis-ready package. It’s ideal for data analysts, business intelligence professionals, and students looking to practice Power BI, Tableau, or SQL projects focusing on sales performance, profitability, and operational cost analysis.
Dataset Structure
Orders Table – Detailed transactional data
Row ID
Order ID
Order Date, Ship Date, Delivery Duration
Ship Mode
Customer ID, Customer Name, Segment, Country, City, State, Postal Code, Region
Product ID, Category, Sub-Category, Product Name
Sales, Quantity, Discount, Discount Value, Profit, COGS
Returns Table – Return records by Order ID
Returned (Yes/No)
Order ID
Shipping Cost Table – State-level shipping expenses
State
Shipping Cost Per Unit
Potential Use Cases
Calculate gross vs. net profit after considering returns and shipping costs.
Perform regional sales and profit analysis.
Identify high-return products and loss-making categories.
Visualize KPIs in Power BI or Tableau.
Build predictive models for returns or shipping costs.
Source & Context The dataset is designed for educational and analytical purposes. It is inspired by retail and e-commerce operations data and was prepared for data analytics portfolio projects.
License Open for use in learning, analytics projects, and data visualization practice.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cyclistic Bike-Share Dataset (2022–2024) – Cleaned & Merged
This dataset contains three full years (2022, 2023, and 2024) of publicly available Cyclistic bike-share trip data. All yearly files have been cleaned, standardized, and merged into a single high-quality master dataset for easy analysis.
The dataset is ideal for:
🔹 Key Cleaning & Processing Steps - Removed duplicate records - Handled missing values - Standardized column names - Converted date-time formats - Created calculated columns (ride length, day, month, etc.) - Merged yearly datasets into one master CSV file (3.17 GB)
🔹 What You Can Analyze - Member vs Casual rider behavior - Peak riding hours and days - Monthly & seasonal trends - Trip duration patterns - Station usage & demand forecasting
This dataset is especially useful for data analyst portfolio projects and technical interview preparation.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.
Perfect for practicing data cleaning and transformation:
2024-01-15, 15/01/2024, 01/15/20241250.50€, €1250.50, 1250.50 EUR, $1375.551250.501250.50 eurosM, F, Male, Female, empty strings150 HP, 150hp, 150 CV, 111 kW, missing valuesto_date() and date parsing functionsregexp_replace() for price cleaningwhen().otherwise() conditional logiccast() for data type conversionsfillna() and dropna() strategiesRealistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions
Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.
Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.
Included Files:
sp500_cleaned.csv – Cleaned dataset used for analysis
sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)
dashboard_screenshot.png – Screenshot of Power BI dashboard
README.md – Summary of the project and key takeaways
This project demonstrates practical data cleaning, querying, and visualization skills.