43 datasets found

S&P 500 Companies Analysis Project
kaggle.com
zip
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
Explore at:
zip(9721576 bytes)Available download formats
Dataset updated
Apr 6, 2025
Authors
anshadkaggle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

Included Files:

sp500_cleaned.csv – Cleaned dataset used for analysis

sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

dashboard_screenshot.png – Screenshot of Power BI dashboard

README.md – Summary of the project and key takeaways

This project demonstrates practical data cleaning, querying, and visualization skills.
IMDB Movies Analysis - SQL
kaggle.com
zip
Updated Feb 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaurav B R (2023). IMDB Movies Analysis - SQL [Dataset]. https://www.kaggle.com/datasets/gauravbr/imdb-movies-data-erd
Explore at:
zip(3818401 bytes)Available download formats
Dataset updated
Feb 21, 2023
Authors
Gaurav B R
Description
SQL IMDB Movies Analysis for RSVP (Film Production Company)

RSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.

For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.
Nvidia Database
kaggle.com
zip
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ajay Tom (2025). Nvidia Database [Dataset]. https://www.kaggle.com/datasets/ajayt0m/nvidia-database
Explore at:
zip(8712 bytes)Available download formats
Dataset updated
Jan 30, 2025
Authors
Ajay Tom
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is a beginner-friendly SQLite database designed to help users practice SQL and relational database concepts. The dataset represents a basic business model inspired by NVIDIA and includes interconnected tables covering essential aspects like products, customers, sales, suppliers, employees, and projects. It's perfect for anyone new to SQL or data analytics who wants to learn and experiment with structured data.

Tables and Their Contents:

Products:

Includes details of 15 products (e.g., GPUs, AI accelerators). Attributes: product_id, product_name, category, release_date, price.

Customers:

Lists 20 fictional customers with their industry and contact information. Attributes: customer_id, customer_name, industry, contact_email, contact_phone.

Sales:

Contains 100 sales records tied to products and customers. Attributes: sale_id, product_id, customer_id, sale_date, region, quantity_sold, revenue.

Suppliers:

Features 50 suppliers and the materials they provide. Attributes: supplier_id, supplier_name, material_supplied, contact_email.

Supply Chain:

Tracks materials supplied to produce products, proportional to sales. Attributes: supply_chain_id, supplier_id, product_id, supply_date, quantity_supplied.

Departments:

Lists 5 departments within the business. Attributes: department_id, department_name, location.

Employees:

Contains data on 30 employees and their roles in different departments. Attributes: employee_id, first_name, last_name, department_id, hire_date, salary.

Projects:

Describes 10 projects handled by different departments. Attributes: project_id, project_name, department_id, start_date, end_date, budget.

Why Use This Dataset?

Perfect for Beginners: The dataset is simple and easy to understand.

Interconnected Tables: Provides a basic introduction to relational database concepts like joins and foreign keys.

SQL Practice: Run basic queries, filter data, and perform simple aggregations or calculations.

Learning Tool: Great for small projects and understanding business datasets.

Potential Use Cases:

Practice SQL queries (SELECT, INSERT, UPDATE, DELETE, JOIN).

Understand how to design and query relational databases.

Analyze basic sales and supply chain data for patterns and trends.

Learn how to use databases in analytics tools like Excel, Power BI, or Tableau.

Data Size:

Number of Tables: 8 Total Rows: Around 230 across all tables, ensuring quick queries and easy exploration.
Employee Database for SQL Case Study
kaggle.com
zip
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riddhi N Divecha (2025). Employee Database for SQL Case Study [Dataset]. https://www.kaggle.com/datasets/riddhindivecha/employee-database-for-sql-case-study/code
Explore at:
zip(890 bytes)Available download formats
Dataset updated
Jun 21, 2025
Authors
Riddhi N Divecha
Description
SQL Case Study Project: Employee Database Analysis 📊

I recently completed a comprehensive SQL project involving a simulated employee database with multiple tables:

🏢 DEPARTMENT

👨‍💼 EMPLOYEE

💼 JOB

🌍 LOCATION

In this project, I practiced and applied a wide range of SQL concepts:

 ✅ Simple Queries  ✅ Filtering with WHERE conditions  ✅ Sorting with ORDER BY  ✅ Aggregation using GROUP BY and HAVING  ✅ Multi-table JOINs  ✅ Conditional Logic using CASE  ✅ Subqueries and Set Operators

💡 Key Highlights:

Salary grade classifications

Department-level insights

Employee trends based on hire dates

Advanced queries like Nth highest salary

🛠️ Tools Used:  Azure Data Studio

📂 You can find the entire project and scripts here: 

👉 https://github.com/RiddhiNDivecha/Employee-Database-Analysis

This project helped me sharpen my SQL skills and understand business logic more deeply in a practical context.

💬 I’m open to feedback and happy to connect with fellow data enthusiasts!

SQL #DataAnalytics #PortfolioProject #CaseStudy #LearningByDoing #DataScience #SQLProject
HR Analytics SQL Exploration with Python & SQLite
kaggle.com
zip
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enes Furkan ALBAYRAK (2025). HR Analytics SQL Exploration with Python & SQLite [Dataset]. https://www.kaggle.com/datasets/enesfurkanalbayrak/hr-analytics-sql-exploration-with-python-and-sqlite
Explore at:
zip(4767 bytes)Available download formats
Dataset updated
May 22, 2025
Authors
Enes Furkan ALBAYRAK
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset used in this project is inspired by the HR Analytics: Job Change of Data Scientists dataset available on Kaggle. It contains information about candidates’ demographics, education, work experience, company details, and training hours, aiming to predict whether a candidate is likely to seek a new job. This simulation recreates the structure of the original dataset in a lightweight SQLite environment to demonstrate SQL operations in Python. It provides an ideal context for learning and practicing essential SQL commands such as CREATE, INSERT, SELECT, JOIN, and more, using realistic HR data scenarios.
SQL Integrity Journey: Unleashing Data Constraints
kaggle.com
zip
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radha Gandhi (2023). SQL Integrity Journey: Unleashing Data Constraints [Dataset]. https://www.kaggle.com/datasets/radhagandhi/sql-integrity-journey-unleashing-data-constraints
Explore at:
zip(13817 bytes)Available download formats
Dataset updated
Oct 9, 2023
Authors
Radha Gandhi
Description
**Title: **Practical Exploration of SQL Constraints: Building a Foundation in Data Integrity Introduction: Welcome to my Data Analysis project, where I focus on mastering SQL constraints—a pivotal aspect of database management. This project centers on hands-on experience with SQL's Data Definition Language (DDL) commands, emphasizing constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, and DEFAULT. In this project, I aim to demonstrate my foundational understanding of enforcing data integrity and maintaining a structured database environment. Purpose: The primary purpose of this project is to showcase my proficiency in implementing and managing SQL constraints for robust data governance. By delving into the realm of constraints, you'll gain insights into my SQL skills and how I utilize constraints to ensure data accuracy, consistency, and reliability within relational databases. What to Expect: Within this project, you will find a series of projects that focus on the implementation and utilization of SQL constraints. These projects highlight my command over the following key constraint types: NOT NULL: The NOT NULL constraint is crucial for ensuring the presence of essential data in a column. PRIMARY KEY: Ensuring unique identification of records for data integrity. FOREIGN KEY: Establishing relationships between tables to maintain referential integrity. UNIQUE: Guaranteeing the uniqueness of values within specified columns. CHECK: Implementing custom conditions to validate data entries. DEFAULT: Setting default values for columns to enhance data reliability. Each exercise within this project is accompanied by clear and concise SQL scripts, explanations of the intended outcomes, and practical insights into the application of these constraints. My goal is to showcase how SQL constraints serve as crucial tools for creating a structured and dependable database foundation. I invite you to explore these projects in detail, where I provide hands-on examples that highlight the importance and utility of SQL constraints. Together, these projects underscore my commitment to upholding data quality, ensuring data accuracy, and harnessing the power of SQL constraints for informed decision-making in data analysis. 3.1 CONSTRAINT - ENFORCING NOT NULL CONSTRAINT WHILE CREATING NEW TABLE. 3.2 CONSTRAINT- ENFORCE NOT NULL CONSTRAINT ON EXISTING COLUMN. 3.3 CONSTRAINT - ENFORCING PRIMARY KEY CONSTRAINT WHILE CREATING A NEW TABLE. 3.4 CONSTRAINT - ENFORCE PRIMARY KEY CONSTRAINT ON EXISTING COLUMN. 3.5 CONSTRAINT - ENFORCING FOREIGN KEY CONSTRAINT WHILE CREATING NEW TABLE. 3.6 CONSTRAINT - ENFORCE FOREIGN KEY CONSTRAINT ON EXISTING COLUMN. 3.7CONSTRAINT - ENFORCING UNIQUE CONSTRAINTS WHILE CREATING A NEW TABLE. 3.8 CONSTRAINT - ENFORCING UNIQUE CONSTRAINT IN EXISTING TABLE. 3.9 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN NEW TABLE. 3.10 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN THE EXISTING TABLE. 3.11 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE NEW TABLE. 3.12 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE EXISTING TABLE.
BookMyShow-SQL-Data-Analysis
kaggle.com
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soumendu Ray (2025). BookMyShow-SQL-Data-Analysis [Dataset]. https://www.kaggle.com/datasets/soumenduray99/bookmyshow-sql-data-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Soumendu Ray
Description
🎟️ BookMyShow SQL Data Analysis 🎯 Objective This project leverages SQL-based analysis to gain actionable insights into user engagement, movie performance, theater efficiency, payment systems, and customer satisfaction on the BookMyShow platform. The goal is to enhance platform performance, boost revenue, and optimize user experience through data-driven strategies.

📊 Key Analysis Areas 1. 👥 User Behavior & Engagement Identify most active users and repeat customers Track unique monthly users Analyze peak booking times and average tickets per user Drive engagement strategies and boost customer retention 2. 🎬 Movie Performance Analysis Highlight top-rated and most booked movies Analyze popular languages and high-revenue genres Study average occupancy rates Focus marketing on high-performing genres and content 3. 🏢 Theater & Show Performance Pinpoint theaters with highest/lowest bookings Evaluate popular show timings Measure theater-wise revenue contribution and occupancy Improve theater scheduling and resource allocation 4. 💵 Booking & Revenue Insights Track total revenue, top spenders, and monthly booking patterns Discover most used payment methods Calculate average price per booking and bookings per user Optimize revenue generation and spending strategies 5. 🪑 Seat Utilization & Pricing Strategy Identify most booked seat types and their revenue impact Analyze seat pricing variations and price elasticity Align pricing strategy with demand patterns for higher revenue 6. ✅❌ Payment & Transaction Analysis Distinguish successful vs. failed transactions Track refund frequency and payment delays Evaluate revenue lost due to failures Enhance payment processing systems 7. ⭐ User Reviews & Sentiment Analysis Measure average ratings per movie Identify top and lowest-rated content Analyze review volume and sentiment trends Leverage feedback to refine content offerings 🧰 Tech Stack Query Language: SQL (MySQL/PostgreSQL) Database Tools: DBeaver, pgAdmin, or any SQL IDE Visualization (Optional): Power BI / Tableau for presenting insights Version Control: Git & GitHub
Data from: Vendor Performance Analysis
kaggle.com
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Madhavan (2025). Vendor Performance Analysis [Dataset]. https://www.kaggle.com/datasets/harshmadhavan/vendor-performance-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 6, 2025
Dataset provided by
Kaggle
Authors
Harsh Madhavan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📖 Dataset Description

This dataset provides an end-to-end view of vendor performance across multiple dimensions — purchases, sales, inventory, pricing, and invoices. It is designed for data analytics, visualization, and business intelligence projects, making it ideal for learners and professionals exploring procurement, vendor management, and supply chain optimization.

🔗 GitHub Project (Code + Power BI Dashboard): Vendor Performance Analysis[https://github.com/HARSH-MADHAVAN/Vendor-Performance-Analysis]

The dataset includes:

purchases.csv → Detailed vendor purchase transactions sales.csv → Sales performance data linked to vendors inventory.csv (begin & end) → Stock levels at different periods purchase_prices.csv → Historical vendor pricing vendor_invoice.csv → Invoice details for reconciliation vendor_sales_summary.csv → Aggregated vendor-wise sales insights

Use this dataset to practice:

SQL querying & data modeling Python analytics & preprocessing Power BI dashboarding & reporting
BigQuery Fintech Dataset
kaggle.com
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Keser (2024). BigQuery Fintech Dataset [Dataset]. https://www.kaggle.com/datasets/mustafakeser4/bigquery-fintech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mustafa Keser
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset: cloud-training-demos.fintech

This dataset, hosted on BigQuery, is designed for financial technology (fintech) training and analysis. It comprises six interconnected tables, each providing detailed insights into various aspects of customer loans, loan purposes, and regional distributions. The dataset is ideal for practicing SQL queries, building data models, and conducting financial analytics.

Tables:

customer:
Contains records of individual customers, including demographic details and unique customer IDs. This table serves as a primary reference for analyzing customer behavior and loan distribution.

loan:
Includes detailed information about each loan issued, such as the loan amount, interest rate, and tenure. The table is crucial for analyzing lending patterns and financial outcomes.

loan_count_by_year:
Provides aggregated loan data by year, offering insights into yearly lending trends. This table helps in understanding the temporal dynamics of loan issuance.

loan_purposes:
Lists various reasons or purposes for which loans were issued, along with corresponding loan counts. This data can be used to analyze customer needs and market demands.

loan_with_region:
Combines loan data with regional information, allowing for geographical analysis of lending activities. This table is key for regional market analysis and understanding how loan distribution varies across different areas.

state_region:
Maps state names to their respective regions, enabling a more granular geographical analysis when combined with other tables in the dataset.

Use Cases:

Customer Segmentation: Analyze customer data to identify distinct segments based on demographics and loan behaviors.

Loan Analysis: Explore loan issuance patterns, interest rates, and purposes to uncover trends and insights.

Regional Analysis: Combine loan and region data to understand how loan distributions vary by geography.

Temporal Trends: Utilize the loan_count_by_year table to observe how lending patterns evolve over time.

This dataset is ideal for those looking to enhance their skills in SQL, financial data analysis, and BigQuery, providing a comprehensive foundation for fintech-related projects and case studies.
Wikipedia SQLITE Portable DB, Huge 5M+ Rows
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
Explore at:
zip(6064169983 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
christernyc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

Key Features:

Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

The database consists of four main tables:

items: Contains information about Wikipedia items, including labels and descriptions

properties: Stores details about Wikidata properties, such as labels and descriptions

pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts

link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

Usage with LIKE queries: ``` import aiosqlite import asyncio

class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

async def _aenter_(self): self.conn = await aiosqlite.connect(self.db_file) return self async def _aexit_(self, exc_type, exc_val, exc_tb): await self.conn.close() async def search_pages_by_title(self, title): query = """ SELECT pages.page_id, pages.item_id, pages.title, pages.views, items.labels AS item_labels, items.description AS item_description, link_annotated_text.sections FROM pages JOIN items ON pages.item_id = items.id JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id WHERE pages.title LIKE ? """ async with self.conn.execute(query, (f"%{title}%",)) as cursor: return await cursor.fetchall() async def search_items_by_label_or_description(self, keyword): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? OR description LIKE ? """ async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor: return await cursor.fetchall() async def search_items_by_label(self, label): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? """ async with self.conn.execute(query, (f"%{label}%",)) as cursor: return await cursor.fetchall() async def search_properties_by_label_or_desc...
Greenspot Grocer SQL Project
kaggle.com
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasinata ndzakawa (2025). Greenspot Grocer SQL Project [Dataset]. https://www.kaggle.com/datasets/wasinatandzakawa/greenspot-grocer-sql-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 3, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wasinata ndzakawa
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This project was a powerful introduction to the practical application of database design and SQL in a real-world scenario. It helped me understand how a well-structured relational database supports business scalability and data integrity — especially for businesses transitioning from flat files like spreadsheets to a more robust system.

One key takeaway for me was the importance of normalizing data, not just to reduce redundancy but to ensure that information is easily queryable and future-proof. Working with MySQL Workbench also gave me hands-on experience in visual database modeling, which made the conceptual relationships between tables much clearer.

While I encountered a few challenges setting up MySQL Workbench and configuring the database connections, overcoming those technical steps gave me more confidence in managing development tools — a crucial skill for both data analysts and back-end developers.

If I were to extend this project in the future, I would consider:

Adding tables for inventory management, supplier information, or delivery tracking

Building simple data dashboards to visualize sales and product performance

Automating the data import process from CSV to SQL

Overall, this project bridged the gap between theory and practical application. It deepened my understanding of how structured data can unlock powerful insights and better decision-making for businesses.
Mastering the Essentials:Hands-On DDL Command Prac
kaggle.com
zip
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radha Gandhi (2023). Mastering the Essentials:Hands-On DDL Command Prac [Dataset]. https://www.kaggle.com/datasets/radhagandhi/1practical-exercise-in-ddl-commands/code
Explore at:
zip(7378 bytes)Available download formats
Dataset updated
Sep 25, 2023
Authors
Radha Gandhi
Description
The Practical Exercise in SQL Data Definition Language (DDL) Commands is a hands-on project designed to help you gain a deep understanding of fundamental DDL commands in SQL, including:

CREATE TABLE,

ALTER(ADD, RENAME, DROP)TABLE,

TRUNCATE TABLE.

This project aims to enhance your proficiency in using SQL to create, modify, and manage database structures effectively.

1.1 DDL-CREATE TABLE

1.2 DDL-ALTER TABLE(ADD)

1.3 DDL-ALTER(RENAME COLUMN NAME)

1.4 DDL-ALTER(RENAME TABLE NAME)

1.5 DDL-ALTER(DROP COLUMN FROM TABLE)

1.6 DDL-ALTER(DROP TABLE)

1.7 DDL- TRUNCATE TABLE
Amazon India Sales 2025 Analysis
kaggle.com
zip
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allen Close (2025). Amazon India Sales 2025 Analysis [Dataset]. https://www.kaggle.com/datasets/allenclose/amazon-india-sales-2025-analysis
Explore at:
zip(3793 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
Allen Close
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
India
Description
Comprehensive Amazon India sales dataset featuring 15,000 synthetic e-commerce transactions from 2025. This cleaned and validated dataset captures real-world shopping patterns including customer behavior, product preferences, payment methods, delivery metrics, and regional sales distribution across Indian states.

Key Features: - 15,000 orders across multiple product categories (Electronics, Clothing, Home & Kitchen, Beauty) - Daily OHLCV-style transactional data from January to December 2025 - Complete customer journey: Order placement, payment, delivery, and review - Geographic coverage across major Indian states - Payment method diversity: Credit Card, Debit Card, UPI, Cash on Delivery - Delivery status tracking: Delivered, Pending, Returned - Customer review ratings and sentiment analysis

Dataset Columns (14): Order_ID, Date, Customer_ID, Product_Category, Product_Name, Quantity, Unit_Price_INR, Total_Sales_INR, Payment_Method, Delivery_Status, Review_Rating, Review_Text, State, Country

Use Cases: - E-commerce sales analysis and forecasting - Customer behavior and segmentation studies - Payment method preference analysis - Regional market trends and geographic insights - Delivery optimization and logistics planning - Product performance and category analysis - Customer satisfaction and review analysis - SQL practice and business intelligence training

Data Quality: - Cleaned and validated for analysis - No missing values in critical fields - Consistent data types and formatting - Ready for immediate SQL/Python analysis

Perfect for data analysts, SQL learners, business intelligence projects, and e-commerce analytics practice!

E-commerce_dataset

kaggle.com

zip

Updated Nov 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Abhay Ayare (2025). E-commerce_dataset [Dataset]. https://www.kaggle.com/datasets/abhayayare/e-commerce-dataset

Explore at:

zip(644123 bytes)Available download formats

Dataset updated

Nov 16, 2025

Authors

Abhay Ayare

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

E-commerce_dataset

This dataset is a synthetic yet realistic E-commerce retail dataset generated programmatically using Python (Faker + NumPy + Pandas).
It is designed to closely mimic real-world online shopping behavior, user patterns, product interactions, seasonal trends, and marketplace events.

You can use this dataset for:

Machine Learning & Deep Learning
Recommender Systems
Customer Segmentation
Sales Forecasting
A/B Testing
E-commerce Behaviour Analysis
Data Cleaning / Feature Engineering Practice
SQL practice

📁Dataset Contents

The dataset contains 6 CSV files: ~~~ File Rows Description users.csv ~10,000 User profiles, demographics & signup info products.csv ~2,000 Product catalog with rating and pricing orders.csv ~20,000 Order-level transactions order_items.csv ~60,000 Items purchased per order reviews.csv ~15,000 Customer-written product reviews events.csv ~80,000 User event logs: view, cart, wishlist, purchase ~~~

🧬 Data Dictionary

1. Users (users.csv)
Column Description
user_id Unique user identifier
name  Full customer name
email  Email (synthetic, no real emails)
gender Male / Female / Other
city  City of residence
signup_date Account creation date

2. Products (products.csv)
Column Description
product_id Unique product identifier
product_name  Product title
category  Electronics, Clothing, Beauty, Home, Sports, etc.
price  Actual selling price
rating Average product rating

3. Orders (orders.csv)
Column Description
order_id  Unique order identifier
user_id User who placed the order
order_date Timestamp of the order
order_status  Completed / Cancelled / Returned
total_amount  Total order value

4. Order Items (order_items.csv)
Column Description
order_item_id  Unique identifier
order_id  Associated order
product_id Purchased product
quantity  Quantity purchased
item_price Price per unit

5. Reviews (reviews.csv)
Column Description
review_id  Unique review identifier
user_id User who submitted review
product_id Reviewed product
rating 1–5 star rating
review_text Short synthetic review
review_date Submission date

6. Events (events.csv)
Column Description
event_id  Unique event identifier
user_id User performing event
product_id Viewed/added/purchased product
event_type view/cart/wishlist/purchase
event_timestamp Timestamp of event

🧠 Possible Use Cases (Ideas & Projects)

🔍 Machine Learning

Customer churn prediction
Review sentiment analysis (NLP)
Recommendation engines
Price optimization models
Demand forecasting (Time-series)

📦 Business Analytics

Market basket analysis
RFM segmentation
Cohort analysis
Funnel conversion tracking
A/B testing simulations

🧮 SQL Practice

Joins
Window functions
Aggregations
CTE-based funnels
Complex queries

🛠 How the Dataset Was Generated

The dataset was generated entirely in Python using:

Faker for realistic user and review generation
NumPy for probability-based event modeling
Pandas for data processing

Custom logic for:

demand variation
user behavior simulation
return/cancel probabilities
seasonal order timestamp distribution
The dataset does not include any real personal data.
Everything is generated synthetically.

⚠️ License

This dataset is released under CC BY 4.0 — free to use for:
Research
Education
Commercial projects
Kaggle competitions
Machine learning pipelines
Just provide attribution.

⭐ If you found this dataset helpful, please:

Upvote the dataset
Leave a comment
Share your notebooks/notebooks using it

Healthcare Fraud Detection Dataset
kaggle.com
zip
Updated Mar 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishal Jaiswal (2025). Healthcare Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/jaiswalmagic1/healthcare-fraud-detection-dataset
Explore at:
zip(10427537 bytes)Available download formats
Dataset updated
Mar 6, 2025
Authors
Vishal Jaiswal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains comprehensive synthetic healthcare data designed for fraud detection analysis. It includes information on patients, healthcare providers, insurance claims, and payments. The dataset is structured to mimic real-world healthcare transactions, where fraudulent activities such as false claims, overbilling, and duplicate charges can be identified through advanced analytics.

The dataset is suitable for practicing SQL queries, exploratory data analysis (EDA), machine learning for fraud detection, and visualization techniques. It is designed to help data analysts and data scientists develop and refine their analytical skills in the healthcare insurance domain.

Dataset Overview The dataset consists of four CSV files:

Patients Data (patients.csv)

Contains demographic details of patients, such as age, gender, insurance type, and location. Can be used to analyze patient demographics and healthcare usage patterns. Providers Data (providers.csv)

Contains information about healthcare providers, including provider ID, specialty, location, and associated hospital.

Useful for identifying fraudulent claims linked to specific providers or hospitals. Claims Data (claims.csv)

Contains records of insurance claims made by patients, including diagnosis codes, treatment details, provider ID, and claim amount.

Can be analyzed for suspicious patterns, such as excessive claims from a single provider or duplicate claims for the same patient.

Payments Data (payments.csv) Contains details of claim payments made by insurance companies, including payment amount, claim ID, and reimbursement status.

Helps in detecting discrepancies between claims and actual reimbursements. Possible Analysis Ideas

This dataset allows for multiple analysis approaches, including but not limited to:

🔹 Fraud Detection: Identify patterns in claims data to detect fraudulent activities (e.g., excessive billing, duplicate claims). 🔹 Provider Behavior Analysis: Analyze providers who have an unusually high claim volume or high rejection rates. 🔹 Payment Trends: Compare claims vs. payments to find irregularities in reimbursement patterns. 🔹 Patient Demographics & Utilization: Explore which patient groups are more likely to file claims and receive reimbursements. 🔹 SQL Query Practice: Perform advanced SQL queries, including joins, aggregations, window functions, and subqueries, to extract insights from the data.

Use Cases Practicing SQL queries for job interviews and real-world projects. Learning data cleaning, data wrangling, and feature engineering for healthcare analytics. Applying machine learning techniques for fraud detection. Gaining insights into the healthcare insurance domain and its challenges.

License & Usage License: CC0 Public Domain (Free to use for any purpose).

Attribution: Not required but appreciated. Intended Use: This dataset is for educational and research purposes only.

This dataset is an excellent resource for aspiring data analysts, data scientists, and SQL learners who want to gain hands-on experience in healthcare fraud detection.
Waddle Portfolio
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin Waddle (2025). Waddle Portfolio [Dataset]. https://www.kaggle.com/datasets/colindwaddle/waddle-portfolio
Explore at:
zip(4330358 bytes)Available download formats
Dataset updated
Jul 31, 2025
Authors
Colin Waddle
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Highlighting both practice projects and contributions I've made on the job, with a focus on practical, results-driven analysis. Each project reflects my ability to solve business problems using tools like Excel for data visualization, SQL for querying and structuring data, and the skills I've built in Python.
Superstore Sales EDA - Nawaf Alzzeer
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nawaf Alzeer (2025). Superstore Sales EDA - Nawaf Alzzeer [Dataset]. https://www.kaggle.com/datasets/nawafalzeer/superstore-sales-eda-nawaf-alzzeer
Explore at:
zip(809072 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Nawaf Alzeer
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Complete data engineering project on 4 years (2014-2017) of retail sales transactions.

DATASET CONTENTS: - Original denormalized data (9,994 rows) - Normalized database: 4 tables (customers, orders, products, sales) - 9 SQL analysis files organized by phase - Complete EDA from data cleaning to business insights

DATABASE TABLES: - customers: 793 records - orders: 4,931 records
- products: 1,812 records - sales: 9,686 transactions

KEY FINDINGS: - Low profitability: 12.44% margin (below industry standard) - Discount problem: 50%+ transactions have 20%+ discounts - Loss-making: 18.66% of transactions lose money - Furniture crisis: Only 2.31% margin - Small baskets: Only 1.96 items per order

SQL SKILLS DEMONSTRATED: ✓ Window functions (ROW_NUMBER, PARTITION BY) ✓ Database normalization (3NF) ✓ Complex JOINs (3-4 tables) ✓ Data deduplication with CTEs ✓ Business analytics queries ✓ CASE statements and aggregations

PERFECT FOR: - SQL practice (beginner to advanced) - Database normalization learning - EDA methodology study - Business analytics projects - Data engineering portfolios

FILES INCLUDED: - 5 CSV files (original + 4 normalized tables) - 9 SQL query files (cleaning, migration, analysis)

Author: Nawaf Alzzeer License: CC BY-SA 4.0
Retail Sales, Returns & Shipping Dataset
kaggle.com
zip
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kunal malviya (2025). Retail Sales, Returns & Shipping Dataset [Dataset]. https://www.kaggle.com/datasets/kunalmalviya06/retail-sales-returns-and-shipping-dataset
Explore at:
zip(632399 bytes)Available download formats
Dataset updated
Aug 15, 2025
Authors
kunal malviya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a comprehensive view of retail operations, combining sales transactions, return records, and shipping cost details into one analysis-ready package. It’s ideal for data analysts, business intelligence professionals, and students looking to practice Power BI, Tableau, or SQL projects focusing on sales performance, profitability, and operational cost analysis.

Dataset Structure

Orders Table – Detailed transactional data

Row ID

Order ID

Order Date, Ship Date, Delivery Duration

Ship Mode

Customer ID, Customer Name, Segment, Country, City, State, Postal Code, Region

Product ID, Category, Sub-Category, Product Name

Sales, Quantity, Discount, Discount Value, Profit, COGS

Returns Table – Return records by Order ID

Returned (Yes/No)

Order ID

Shipping Cost Table – State-level shipping expenses

State

Shipping Cost Per Unit

Potential Use Cases

Calculate gross vs. net profit after considering returns and shipping costs.

Perform regional sales and profit analysis.

Identify high-return products and loss-making categories.

Visualize KPIs in Power BI or Tableau.

Build predictive models for returns or shipping costs.

Source & Context The dataset is designed for educational and analytical purposes. It is inspired by retail and e-commerce operations data and was prepared for data analytics portfolio projects.

License Open for use in learning, analytics projects, and data visualization practice.
cyclistic-bike-share-2022-2024-clean
kaggle.com
zip
Updated Nov 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chathuranga Sudusinghe (2025). cyclistic-bike-share-2022-2024-clean [Dataset]. https://www.kaggle.com/datasets/indrajithsudusinghe/cyclistic-bike-share-2022-2024-clean
Explore at:
zip(579891587 bytes)Available download formats
Dataset updated
Nov 28, 2025
Authors
Chathuranga Sudusinghe
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Cyclistic Bike-Share Dataset (2022–2024) – Cleaned & Merged

This dataset contains three full years (2022, 2023, and 2024) of publicly available Cyclistic bike-share trip data. All yearly files have been cleaned, standardized, and merged into a single high-quality master dataset for easy analysis.

The dataset is ideal for:

Data Analysis & Visualization

SQL Projects

Python (Pandas) Practice

Power BI, Tableau Dashboards

Machine Learning Feature Engineering

🔹 Key Cleaning & Processing Steps - Removed duplicate records - Handled missing values - Standardized column names - Converted date-time formats - Created calculated columns (ride length, day, month, etc.) - Merged yearly datasets into one master CSV file (3.17 GB)

🔹 What You Can Analyze - Member vs Casual rider behavior - Peak riding hours and days - Monthly & seasonal trends - Trip duration patterns - Station usage & demand forecasting

This dataset is especially useful for data analyst portfolio projects and technical interview preparation.
Insurance Dataset for Data Engineering Practice
kaggle.com
zip
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KPOVIESI Olaolouwa Amiche Stéphane (2025). Insurance Dataset for Data Engineering Practice [Dataset]. https://www.kaggle.com/datasets/kpoviesistphane/insurance-dataset-for-data-engineering-practice
Explore at:
zip(475362 bytes)Available download formats
Dataset updated
Sep 24, 2025
Authors
KPOVIESI Olaolouwa Amiche Stéphane
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Insurance Dataset for Data Engineering Practice

Overview

A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.

Dataset Contents

📊 Three Main Tables:

contracts.csv (~15,000 rows) - Insurance contracts with client information

claims.csv (~6,000 rows) - Insurance claims with damage and settlement details

vehicles.csv (~12,000 rows) - Vehicle information for auto insurance contracts

🗺️ Geographic Coverage:

French cities with realistic postal codes

Risk zone classifications (High/Medium/Low)

Regional pricing coefficients

🏷️ Product Types:

Auto Insurance (majority)

Home Insurance

Life Insurance

Health Insurance

🎯 Intentional Data Quality Issues

Perfect for practicing data cleaning and transformation:

Date Format Issues:

Mixed formats: 2024-01-15, 15/01/2024, 01/15/2024

String storage requiring parsing and standardization

Price Format Inconsistencies:

Multiple currency formats: 1250.50€, €1250.50, 1250.50 EUR, $1375.55

Missing currency symbols: 1250.50

Written formats: 1250.50 euros

Missing Data Patterns:

Strategic missingness in age (8%), CSP (12%), expert_id (20-25%)

Realistic patterns based on business logic

Categorical Inconsistencies:

Gender: M, F, Male, Female, empty strings

Power units: 150 HP, 150hp, 150 CV, 111 kW, missing values

Data Type Issues:

Numeric values stored as strings

Mixed data types requiring casting

🚀 Perfect for Practicing:

PySpark Operations:

to_date() and date parsing functions

regexp_replace() for price cleaning

when().otherwise() conditional logic

cast() for data type conversions

fillna() and dropna() strategies

Data Engineering Tasks:

ETL pipeline development

Data validation and quality checks

Join operations across related tables

Aggregation with business logic

Data standardization workflows

Analytics & ML:

Customer segmentation

Claim frequency analysis

Premium pricing models

Risk assessment by geography

Churn prediction

🏢 Business Context

Realistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions

💡 Use Cases:

Data Engineering Bootcamps: Hands-on PySpark practice

SQL Training: Complex joins and aggregations

Data Science Projects: End-to-end ML pipeline development

Business Intelligence: Dashboard and reporting practice

Data Quality Workshops: Cleaning and validation techniques

🔧 Tools Compatibility:

Apache Spark / PySpark

Pandas / Python

SQL databases

Databricks

Google Cloud Dataflow

AWS Glue

📈 Difficulty Level:

Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.

Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.

Facebook

Twitter

Click to copy link

Link copied

Cite

anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project

S&P 500 Companies Analysis Project

A complete data analytics project combining Python, SQL, and Power BI to explore

Explore at:

zip(9721576 bytes)Available download formats

Dataset updated

Apr 6, 2025

Authors

anshadkaggle

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

Included Files:

sp500_cleaned.csv – Cleaned dataset used for analysis

sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

dashboard_screenshot.png – Screenshot of Power BI dashboard

README.md – Summary of the project and key takeaways

This project demonstrates practical data cleaning, querying, and visualization skills.

Clear search

Close search

Google apps

Main menu

S&P 500 Companies Analysis Project

IMDB Movies Analysis - SQL

SQL IMDB Movies Analysis for RSVP (Film Production Company)

Nvidia Database

Tables and Their Contents:

Products:

Customers:

Sales:

Suppliers:

Supply Chain:

Departments:

Employees:

Projects:

Why Use This Dataset?

Potential Use Cases:

Data Size:

Employee Database for SQL Case Study

SQL #DataAnalytics #PortfolioProject #CaseStudy #LearningByDoing #DataScience #SQLProject

HR Analytics SQL Exploration with Python & SQLite

SQL Integrity Journey: Unleashing Data Constraints

BookMyShow-SQL-Data-Analysis

Data from: Vendor Performance Analysis

BigQuery Fintech Dataset

Tables:

Use Cases:

Wikipedia SQLITE Portable DB, Huge 5M+ Rows

Greenspot Grocer SQL Project

Mastering the Essentials:Hands-On DDL Command Prac

Amazon India Sales 2025 Analysis

E-commerce_dataset

E-commerce_dataset

You can use this dataset for:

📁**Dataset Contents**

🧬 Data Dictionary

🧠 Possible Use Cases (Ideas & Projects)

🔍 Machine Learning

📦 Business Analytics

🧮 SQL Practice

🛠 How the Dataset Was Generated

The dataset was generated entirely in Python using:

Custom logic for:

⚠️ License

⭐ If you found this dataset helpful, please:

Healthcare Fraud Detection Dataset

Waddle Portfolio

Superstore Sales EDA - Nawaf Alzzeer

Retail Sales, Returns & Shipping Dataset

cyclistic-bike-share-2022-2024-clean

Insurance Dataset for Data Engineering Practice

Insurance Dataset for Data Engineering Practice

Overview

Dataset Contents

📊 Three Main Tables:

🗺️ Geographic Coverage:

🏷️ Product Types:

🎯 Intentional Data Quality Issues

Date Format Issues:

Price Format Inconsistencies:

Missing Data Patterns:

Categorical Inconsistencies:

Data Type Issues:

🚀 Perfect for Practicing:

PySpark Operations:

Data Engineering Tasks:

Analytics & ML:

🏢 Business Context

💡 Use Cases:

🔧 Tools Compatibility:

📈 Difficulty Level:

S&P 500 Companies Analysis Project

A complete data analytics project combining Python, SQL, and Power BI to explore

📁Dataset Contents