43 datasets found
  1. S&P 500 Companies Analysis Project

    • kaggle.com
    zip
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
    Explore at:
    zip(9721576 bytes)Available download formats
    Dataset updated
    Apr 6, 2025
    Authors
    anshadkaggle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

    Included Files:

    sp500_cleaned.csv – Cleaned dataset used for analysis

    sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

    dashboard_screenshot.png – Screenshot of Power BI dashboard

    README.md – Summary of the project and key takeaways

    This project demonstrates practical data cleaning, querying, and visualization skills.

  2. IMDB Movies Analysis - SQL

    • kaggle.com
    zip
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav B R (2023). IMDB Movies Analysis - SQL [Dataset]. https://www.kaggle.com/datasets/gauravbr/imdb-movies-data-erd
    Explore at:
    zip(3818401 bytes)Available download formats
    Dataset updated
    Feb 21, 2023
    Authors
    Gaurav B R
    Description

    SQL IMDB Movies Analysis for RSVP (Film Production Company)

    RSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

    The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.

    For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.

  3. Nvidia Database

    • kaggle.com
    zip
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajay Tom (2025). Nvidia Database [Dataset]. https://www.kaggle.com/datasets/ajayt0m/nvidia-database
    Explore at:
    zip(8712 bytes)Available download formats
    Dataset updated
    Jan 30, 2025
    Authors
    Ajay Tom
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a beginner-friendly SQLite database designed to help users practice SQL and relational database concepts. The dataset represents a basic business model inspired by NVIDIA and includes interconnected tables covering essential aspects like products, customers, sales, suppliers, employees, and projects. It's perfect for anyone new to SQL or data analytics who wants to learn and experiment with structured data.

    Tables and Their Contents:

    Products:

    Includes details of 15 products (e.g., GPUs, AI accelerators). Attributes: product_id, product_name, category, release_date, price.

    Customers:

    Lists 20 fictional customers with their industry and contact information. Attributes: customer_id, customer_name, industry, contact_email, contact_phone.

    Sales:

    Contains 100 sales records tied to products and customers. Attributes: sale_id, product_id, customer_id, sale_date, region, quantity_sold, revenue.

    Suppliers:

    Features 50 suppliers and the materials they provide. Attributes: supplier_id, supplier_name, material_supplied, contact_email.

    Supply Chain:

    Tracks materials supplied to produce products, proportional to sales. Attributes: supply_chain_id, supplier_id, product_id, supply_date, quantity_supplied.

    Departments:

    Lists 5 departments within the business. Attributes: department_id, department_name, location.

    Employees:

    Contains data on 30 employees and their roles in different departments. Attributes: employee_id, first_name, last_name, department_id, hire_date, salary.

    Projects:

    Describes 10 projects handled by different departments. Attributes: project_id, project_name, department_id, start_date, end_date, budget.

    Why Use This Dataset?

    • Perfect for Beginners: The dataset is simple and easy to understand.
    • Interconnected Tables: Provides a basic introduction to relational database concepts like joins and foreign keys.
    • SQL Practice: Run basic queries, filter data, and perform simple aggregations or calculations.
    • Learning Tool: Great for small projects and understanding business datasets.

    Potential Use Cases:

    • Practice SQL queries (SELECT, INSERT, UPDATE, DELETE, JOIN).
    • Understand how to design and query relational databases.
    • Analyze basic sales and supply chain data for patterns and trends.
    • Learn how to use databases in analytics tools like Excel, Power BI, or Tableau.

    Data Size:

    Number of Tables: 8 Total Rows: Around 230 across all tables, ensuring quick queries and easy exploration.

  4. Employee Database for SQL Case Study

    • kaggle.com
    zip
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riddhi N Divecha (2025). Employee Database for SQL Case Study [Dataset]. https://www.kaggle.com/datasets/riddhindivecha/employee-database-for-sql-case-study/code
    Explore at:
    zip(890 bytes)Available download formats
    Dataset updated
    Jun 21, 2025
    Authors
    Riddhi N Divecha
    Description

    SQL Case Study Project: Employee Database Analysis 📊

    I recently completed a comprehensive SQL project involving a simulated employee database with multiple tables:

    • 🏢 DEPARTMENT
    • 👨‍💼 EMPLOYEE
    • 💼 JOB
    • 🌍 LOCATION

    In this project, I practiced and applied a wide range of SQL concepts:

    
✅ Simple Queries 
✅ Filtering with WHERE conditions 
✅ Sorting with ORDER BY 
✅ Aggregation using GROUP BY and HAVING 
✅ Multi-table JOINs
 ✅ Conditional Logic using CASE 
✅ Subqueries and Set Operators

    💡 Key Highlights:

    • Salary grade classifications
    • Department-level insights
    • Employee trends based on hire dates
    • Advanced queries like Nth highest salary

    🛠️ Tools Used:
 Azure Data Studio

    📂 You can find the entire project and scripts here:


    👉 https://github.com/RiddhiNDivecha/Employee-Database-Analysis

    This project helped me sharpen my SQL skills and understand business logic more deeply in a practical context.

    💬 I’m open to feedback and happy to connect with fellow data enthusiasts!

    SQL #DataAnalytics #PortfolioProject #CaseStudy #LearningByDoing #DataScience #SQLProject

  5. HR Analytics SQL Exploration with Python & SQLite

    • kaggle.com
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enes Furkan ALBAYRAK (2025). HR Analytics SQL Exploration with Python & SQLite [Dataset]. https://www.kaggle.com/datasets/enesfurkanalbayrak/hr-analytics-sql-exploration-with-python-and-sqlite
    Explore at:
    zip(4767 bytes)Available download formats
    Dataset updated
    May 22, 2025
    Authors
    Enes Furkan ALBAYRAK
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset used in this project is inspired by the HR Analytics: Job Change of Data Scientists dataset available on Kaggle. It contains information about candidates’ demographics, education, work experience, company details, and training hours, aiming to predict whether a candidate is likely to seek a new job. This simulation recreates the structure of the original dataset in a lightweight SQLite environment to demonstrate SQL operations in Python. It provides an ideal context for learning and practicing essential SQL commands such as CREATE, INSERT, SELECT, JOIN, and more, using realistic HR data scenarios.

  6. SQL Integrity Journey: Unleashing Data Constraints

    • kaggle.com
    zip
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radha Gandhi (2023). SQL Integrity Journey: Unleashing Data Constraints [Dataset]. https://www.kaggle.com/datasets/radhagandhi/sql-integrity-journey-unleashing-data-constraints
    Explore at:
    zip(13817 bytes)Available download formats
    Dataset updated
    Oct 9, 2023
    Authors
    Radha Gandhi
    Description

    **Title: **Practical Exploration of SQL Constraints: Building a Foundation in Data Integrity Introduction: Welcome to my Data Analysis project, where I focus on mastering SQL constraints—a pivotal aspect of database management. This project centers on hands-on experience with SQL's Data Definition Language (DDL) commands, emphasizing constraints such as PRIMARY KEY, FOREIGN KEY, UNIQUE, CHECK, and DEFAULT. In this project, I aim to demonstrate my foundational understanding of enforcing data integrity and maintaining a structured database environment. Purpose: The primary purpose of this project is to showcase my proficiency in implementing and managing SQL constraints for robust data governance. By delving into the realm of constraints, you'll gain insights into my SQL skills and how I utilize constraints to ensure data accuracy, consistency, and reliability within relational databases. What to Expect: Within this project, you will find a series of projects that focus on the implementation and utilization of SQL constraints. These projects highlight my command over the following key constraint types: NOT NULL: The NOT NULL constraint is crucial for ensuring the presence of essential data in a column. PRIMARY KEY: Ensuring unique identification of records for data integrity. FOREIGN KEY: Establishing relationships between tables to maintain referential integrity. UNIQUE: Guaranteeing the uniqueness of values within specified columns. CHECK: Implementing custom conditions to validate data entries. DEFAULT: Setting default values for columns to enhance data reliability. Each exercise within this project is accompanied by clear and concise SQL scripts, explanations of the intended outcomes, and practical insights into the application of these constraints. My goal is to showcase how SQL constraints serve as crucial tools for creating a structured and dependable database foundation. I invite you to explore these projects in detail, where I provide hands-on examples that highlight the importance and utility of SQL constraints. Together, these projects underscore my commitment to upholding data quality, ensuring data accuracy, and harnessing the power of SQL constraints for informed decision-making in data analysis. 3.1 CONSTRAINT - ENFORCING NOT NULL CONSTRAINT WHILE CREATING NEW TABLE. 3.2 CONSTRAINT- ENFORCE NOT NULL CONSTRAINT ON EXISTING COLUMN. 3.3 CONSTRAINT - ENFORCING PRIMARY KEY CONSTRAINT WHILE CREATING A NEW TABLE. 3.4 CONSTRAINT - ENFORCE PRIMARY KEY CONSTRAINT ON EXISTING COLUMN. 3.5 CONSTRAINT - ENFORCING FOREIGN KEY CONSTRAINT WHILE CREATING NEW TABLE. 3.6 CONSTRAINT - ENFORCE FOREIGN KEY CONSTRAINT ON EXISTING COLUMN. 3.7CONSTRAINT - ENFORCING UNIQUE CONSTRAINTS WHILE CREATING A NEW TABLE. 3.8 CONSTRAINT - ENFORCING UNIQUE CONSTRAINT IN EXISTING TABLE. 3.9 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN NEW TABLE. 3.10 CONSTRAINT - ENFORCING CHECK CONSTRAINT IN THE EXISTING TABLE. 3.11 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE NEW TABLE. 3.12 CONSTRAINT - ENFORCING DEFAULT CONSTRAINT IN THE EXISTING TABLE.

  7. BookMyShow-SQL-Data-Analysis

    • kaggle.com
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soumendu Ray (2025). BookMyShow-SQL-Data-Analysis [Dataset]. https://www.kaggle.com/datasets/soumenduray99/bookmyshow-sql-data-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Soumendu Ray
    Description

    🎟️ BookMyShow SQL Data Analysis 🎯 Objective This project leverages SQL-based analysis to gain actionable insights into user engagement, movie performance, theater efficiency, payment systems, and customer satisfaction on the BookMyShow platform. The goal is to enhance platform performance, boost revenue, and optimize user experience through data-driven strategies.

    📊 Key Analysis Areas 1. 👥 User Behavior & Engagement Identify most active users and repeat customers Track unique monthly users Analyze peak booking times and average tickets per user Drive engagement strategies and boost customer retention 2. 🎬 Movie Performance Analysis Highlight top-rated and most booked movies Analyze popular languages and high-revenue genres Study average occupancy rates Focus marketing on high-performing genres and content 3. 🏢 Theater & Show Performance Pinpoint theaters with highest/lowest bookings Evaluate popular show timings Measure theater-wise revenue contribution and occupancy Improve theater scheduling and resource allocation 4. 💵 Booking & Revenue Insights Track total revenue, top spenders, and monthly booking patterns Discover most used payment methods Calculate average price per booking and bookings per user Optimize revenue generation and spending strategies 5. 🪑 Seat Utilization & Pricing Strategy Identify most booked seat types and their revenue impact Analyze seat pricing variations and price elasticity Align pricing strategy with demand patterns for higher revenue 6. ✅❌ Payment & Transaction Analysis Distinguish successful vs. failed transactions Track refund frequency and payment delays Evaluate revenue lost due to failures Enhance payment processing systems 7. ⭐ User Reviews & Sentiment Analysis Measure average ratings per movie Identify top and lowest-rated content Analyze review volume and sentiment trends Leverage feedback to refine content offerings 🧰 Tech Stack Query Language: SQL (MySQL/PostgreSQL) Database Tools: DBeaver, pgAdmin, or any SQL IDE Visualization (Optional): Power BI / Tableau for presenting insights Version Control: Git & GitHub

  8. Data from: Vendor Performance Analysis

    • kaggle.com
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harsh Madhavan (2025). Vendor Performance Analysis [Dataset]. https://www.kaggle.com/datasets/harshmadhavan/vendor-performance-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    Kaggle
    Authors
    Harsh Madhavan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📖 Dataset Description

    This dataset provides an end-to-end view of vendor performance across multiple dimensions — purchases, sales, inventory, pricing, and invoices. It is designed for data analytics, visualization, and business intelligence projects, making it ideal for learners and professionals exploring procurement, vendor management, and supply chain optimization.

    🔗 GitHub Project (Code + Power BI Dashboard): Vendor Performance Analysis[https://github.com/HARSH-MADHAVAN/Vendor-Performance-Analysis]

    The dataset includes:

    purchases.csv → Detailed vendor purchase transactions sales.csv → Sales performance data linked to vendors inventory.csv (begin & end) → Stock levels at different periods purchase_prices.csv → Historical vendor pricing vendor_invoice.csv → Invoice details for reconciliation vendor_sales_summary.csv → Aggregated vendor-wise sales insights

    Use this dataset to practice:

    SQL querying & data modeling Python analytics & preprocessing Power BI dashboarding & reporting

  9. BigQuery Fintech Dataset

    • kaggle.com
    Updated Aug 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mustafa Keser (2024). BigQuery Fintech Dataset [Dataset]. https://www.kaggle.com/datasets/mustafakeser4/bigquery-fintech-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mustafa Keser
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset: cloud-training-demos.fintech

    This dataset, hosted on BigQuery, is designed for financial technology (fintech) training and analysis. It comprises six interconnected tables, each providing detailed insights into various aspects of customer loans, loan purposes, and regional distributions. The dataset is ideal for practicing SQL queries, building data models, and conducting financial analytics.

    Tables:

    1. customer:
      Contains records of individual customers, including demographic details and unique customer IDs. This table serves as a primary reference for analyzing customer behavior and loan distribution.

    2. loan:
      Includes detailed information about each loan issued, such as the loan amount, interest rate, and tenure. The table is crucial for analyzing lending patterns and financial outcomes.

    3. loan_count_by_year:
      Provides aggregated loan data by year, offering insights into yearly lending trends. This table helps in understanding the temporal dynamics of loan issuance.

    4. loan_purposes:
      Lists various reasons or purposes for which loans were issued, along with corresponding loan counts. This data can be used to analyze customer needs and market demands.

    5. loan_with_region:
      Combines loan data with regional information, allowing for geographical analysis of lending activities. This table is key for regional market analysis and understanding how loan distribution varies across different areas.

    6. state_region:
      Maps state names to their respective regions, enabling a more granular geographical analysis when combined with other tables in the dataset.

    Use Cases:

    • Customer Segmentation: Analyze customer data to identify distinct segments based on demographics and loan behaviors.
    • Loan Analysis: Explore loan issuance patterns, interest rates, and purposes to uncover trends and insights.
    • Regional Analysis: Combine loan and region data to understand how loan distributions vary by geography.
    • Temporal Trends: Utilize the loan_count_by_year table to observe how lending patterns evolve over time.

    This dataset is ideal for those looking to enhance their skills in SQL, financial data analysis, and BigQuery, providing a comprehensive foundation for fintech-related projects and case studies.

  10. Wikipedia SQLITE Portable DB, Huge 5M+ Rows

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
    Explore at:
    zip(6064169983 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    christernyc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

    I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

    Key Features:

    Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

    The database consists of four main tables:

    • items: Contains information about Wikipedia items, including labels and descriptions
    • properties: Stores details about Wikidata properties, such as labels and descriptions
    • pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts
    • link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

    This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

    https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

    Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

    Usage with LIKE queries: ``` import aiosqlite import asyncio

    class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

    async def _aenter_(self):
      self.conn = await aiosqlite.connect(self.db_file)
      return self
    
    async def _aexit_(self, exc_type, exc_val, exc_tb):
      await self.conn.close()
    
    async def search_pages_by_title(self, title):
      query = """
      SELECT pages.page_id, pages.item_id, pages.title, pages.views, 
          items.labels AS item_labels, items.description AS item_description,
          link_annotated_text.sections
      FROM pages 
      JOIN items ON pages.item_id = items.id
      JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
      WHERE pages.title LIKE ?
      """
      async with self.conn.execute(query, (f"%{title}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label_or_description(self, keyword):
      query = """
      SELECT id, labels, description 
      FROM items
      WHERE labels LIKE ? OR description LIKE ?
      """
      async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label(self, label):
      query = """
      SELECT id, labels, description
      FROM items 
      WHERE labels LIKE ?
      """
      async with self.conn.execute(query, (f"%{label}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_properties_by_label_or_desc...
    
  11. Greenspot Grocer SQL Project

    • kaggle.com
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasinata ndzakawa (2025). Greenspot Grocer SQL Project [Dataset]. https://www.kaggle.com/datasets/wasinatandzakawa/greenspot-grocer-sql-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wasinata ndzakawa
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This project was a powerful introduction to the practical application of database design and SQL in a real-world scenario. It helped me understand how a well-structured relational database supports business scalability and data integrity — especially for businesses transitioning from flat files like spreadsheets to a more robust system.

    One key takeaway for me was the importance of normalizing data, not just to reduce redundancy but to ensure that information is easily queryable and future-proof. Working with MySQL Workbench also gave me hands-on experience in visual database modeling, which made the conceptual relationships between tables much clearer.

    While I encountered a few challenges setting up MySQL Workbench and configuring the database connections, overcoming those technical steps gave me more confidence in managing development tools — a crucial skill for both data analysts and back-end developers.

    If I were to extend this project in the future, I would consider:

    Adding tables for inventory management, supplier information, or delivery tracking

    Building simple data dashboards to visualize sales and product performance

    Automating the data import process from CSV to SQL

    Overall, this project bridged the gap between theory and practical application. It deepened my understanding of how structured data can unlock powerful insights and better decision-making for businesses.

  12. Mastering the Essentials:Hands-On DDL Command Prac

    • kaggle.com
    zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radha Gandhi (2023). Mastering the Essentials:Hands-On DDL Command Prac [Dataset]. https://www.kaggle.com/datasets/radhagandhi/1practical-exercise-in-ddl-commands/code
    Explore at:
    zip(7378 bytes)Available download formats
    Dataset updated
    Sep 25, 2023
    Authors
    Radha Gandhi
    Description

    The Practical Exercise in SQL Data Definition Language (DDL) Commands is a hands-on project designed to help you gain a deep understanding of fundamental DDL commands in SQL, including:

    • CREATE TABLE,
    • ALTER(ADD, RENAME, DROP)TABLE,
    • TRUNCATE TABLE.

    This project aims to enhance your proficiency in using SQL to create, modify, and manage database structures effectively.

    1.1 DDL-CREATE TABLE

    1.2 DDL-ALTER TABLE(ADD)

    1.3 DDL-ALTER(RENAME COLUMN NAME)

    1.4 DDL-ALTER(RENAME TABLE NAME)

    1.5 DDL-ALTER(DROP COLUMN FROM TABLE)

    1.6 DDL-ALTER(DROP TABLE)

    1.7 DDL- TRUNCATE TABLE

  13. Amazon India Sales 2025 Analysis

    • kaggle.com
    zip
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allen Close (2025). Amazon India Sales 2025 Analysis [Dataset]. https://www.kaggle.com/datasets/allenclose/amazon-india-sales-2025-analysis
    Explore at:
    zip(3793 bytes)Available download formats
    Dataset updated
    Nov 8, 2025
    Authors
    Allen Close
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    India
    Description

    Comprehensive Amazon India sales dataset featuring 15,000 synthetic e-commerce transactions from 2025. This cleaned and validated dataset captures real-world shopping patterns including customer behavior, product preferences, payment methods, delivery metrics, and regional sales distribution across Indian states.

    Key Features: - 15,000 orders across multiple product categories (Electronics, Clothing, Home & Kitchen, Beauty) - Daily OHLCV-style transactional data from January to December 2025 - Complete customer journey: Order placement, payment, delivery, and review - Geographic coverage across major Indian states - Payment method diversity: Credit Card, Debit Card, UPI, Cash on Delivery - Delivery status tracking: Delivered, Pending, Returned - Customer review ratings and sentiment analysis

    Dataset Columns (14): Order_ID, Date, Customer_ID, Product_Category, Product_Name, Quantity, Unit_Price_INR, Total_Sales_INR, Payment_Method, Delivery_Status, Review_Rating, Review_Text, State, Country

    Use Cases: - E-commerce sales analysis and forecasting - Customer behavior and segmentation studies - Payment method preference analysis - Regional market trends and geographic insights - Delivery optimization and logistics planning - Product performance and category analysis - Customer satisfaction and review analysis - SQL practice and business intelligence training

    Data Quality: - Cleaned and validated for analysis - No missing values in critical fields - Consistent data types and formatting - Ready for immediate SQL/Python analysis

    Perfect for data analysts, SQL learners, business intelligence projects, and e-commerce analytics practice!

  14. E-commerce_dataset

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhay Ayare (2025). E-commerce_dataset [Dataset]. https://www.kaggle.com/datasets/abhayayare/e-commerce-dataset
    Explore at:
    zip(644123 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    Abhay Ayare
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    E-commerce_dataset

    This dataset is a synthetic yet realistic E-commerce retail dataset generated programmatically using Python (Faker + NumPy + Pandas).
    It is designed to closely mimic real-world online shopping behavior, user patterns, product interactions, seasonal trends, and marketplace events.
    
    

    You can use this dataset for:

    Machine Learning & Deep Learning
    Recommender Systems
    Customer Segmentation
    Sales Forecasting
    A/B Testing
    E-commerce Behaviour Analysis
    Data Cleaning / Feature Engineering Practice
    SQL practice
    

    📁**Dataset Contents**

    The dataset contains 6 CSV files: ~~~ File Rows Description users.csv ~10,000 User profiles, demographics & signup info products.csv ~2,000 Product catalog with rating and pricing orders.csv ~20,000 Order-level transactions order_items.csv ~60,000 Items purchased per order reviews.csv ~15,000 Customer-written product reviews events.csv ~80,000 User event logs: view, cart, wishlist, purchase ~~~

    🧬 Data Dictionary

    1. Users (users.csv)
    Column Description
    user_id Unique user identifier
    name  Full customer name
    email  Email (synthetic, no real emails)
    gender Male / Female / Other
    city  City of residence
    signup_date Account creation date
    
    2. Products (products.csv)
    Column Description
    product_id Unique product identifier
    product_name  Product title
    category  Electronics, Clothing, Beauty, Home, Sports, etc.
    price  Actual selling price
    rating Average product rating
    
    3. Orders (orders.csv)
    Column Description
    order_id  Unique order identifier
    user_id User who placed the order
    order_date Timestamp of the order
    order_status  Completed / Cancelled / Returned
    total_amount  Total order value
    
    4. Order Items (order_items.csv)
    Column Description
    order_item_id  Unique identifier
    order_id  Associated order
    product_id Purchased product
    quantity  Quantity purchased
    item_price Price per unit
    
    5. Reviews (reviews.csv)
    Column Description
    review_id  Unique review identifier
    user_id User who submitted review
    product_id Reviewed product
    rating 1–5 star rating
    review_text Short synthetic review
    review_date Submission date
    
    6. Events (events.csv)
    Column Description
    event_id  Unique event identifier
    user_id User performing event
    product_id Viewed/added/purchased product
    event_type view/cart/wishlist/purchase
    event_timestamp Timestamp of event
    

    🧠 Possible Use Cases (Ideas & Projects)

    🔍 Machine Learning

    Customer churn prediction
    Review sentiment analysis (NLP)
    Recommendation engines
    Price optimization models
    Demand forecasting (Time-series)
    

    📦 Business Analytics

    Market basket analysis
    RFM segmentation
    Cohort analysis
    Funnel conversion tracking
    A/B testing simulations
    

    🧮 SQL Practice

    Joins
    Window functions
    Aggregations
    CTE-based funnels
    Complex queries
    

    🛠 How the Dataset Was Generated

    The dataset was generated entirely in Python using:

    Faker for realistic user and review generation
    NumPy for probability-based event modeling
    Pandas for data processing
    

    Custom logic for:

    demand variation
    user behavior simulation
    return/cancel probabilities
    seasonal order timestamp distribution
    The dataset does not include any real personal data.
    Everything is generated synthetically.
    

    ⚠️ License

    This dataset is released under CC BY 4.0 — free to use for:
    Research
    Education
    Commercial projects
    Kaggle competitions
    Machine learning pipelines
    Just provide attribution.
    

    ⭐ If you found this dataset helpful, please:

    Upvote the dataset
    Leave a comment
    Share your notebooks/notebooks using it
    
  15. Healthcare Fraud Detection Dataset

    • kaggle.com
    zip
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishal Jaiswal (2025). Healthcare Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/jaiswalmagic1/healthcare-fraud-detection-dataset
    Explore at:
    zip(10427537 bytes)Available download formats
    Dataset updated
    Mar 6, 2025
    Authors
    Vishal Jaiswal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains comprehensive synthetic healthcare data designed for fraud detection analysis. It includes information on patients, healthcare providers, insurance claims, and payments. The dataset is structured to mimic real-world healthcare transactions, where fraudulent activities such as false claims, overbilling, and duplicate charges can be identified through advanced analytics.

    The dataset is suitable for practicing SQL queries, exploratory data analysis (EDA), machine learning for fraud detection, and visualization techniques. It is designed to help data analysts and data scientists develop and refine their analytical skills in the healthcare insurance domain.

    Dataset Overview The dataset consists of four CSV files:

    Patients Data (patients.csv)

    Contains demographic details of patients, such as age, gender, insurance type, and location. Can be used to analyze patient demographics and healthcare usage patterns. Providers Data (providers.csv)

    Contains information about healthcare providers, including provider ID, specialty, location, and associated hospital.

    Useful for identifying fraudulent claims linked to specific providers or hospitals. Claims Data (claims.csv)

    Contains records of insurance claims made by patients, including diagnosis codes, treatment details, provider ID, and claim amount.

    Can be analyzed for suspicious patterns, such as excessive claims from a single provider or duplicate claims for the same patient.

    Payments Data (payments.csv) Contains details of claim payments made by insurance companies, including payment amount, claim ID, and reimbursement status.

    Helps in detecting discrepancies between claims and actual reimbursements. Possible Analysis Ideas

    This dataset allows for multiple analysis approaches, including but not limited to:

    🔹 Fraud Detection: Identify patterns in claims data to detect fraudulent activities (e.g., excessive billing, duplicate claims). 🔹 Provider Behavior Analysis: Analyze providers who have an unusually high claim volume or high rejection rates. 🔹 Payment Trends: Compare claims vs. payments to find irregularities in reimbursement patterns. 🔹 Patient Demographics & Utilization: Explore which patient groups are more likely to file claims and receive reimbursements. 🔹 SQL Query Practice: Perform advanced SQL queries, including joins, aggregations, window functions, and subqueries, to extract insights from the data.

    Use Cases Practicing SQL queries for job interviews and real-world projects. Learning data cleaning, data wrangling, and feature engineering for healthcare analytics. Applying machine learning techniques for fraud detection. Gaining insights into the healthcare insurance domain and its challenges.

    License & Usage License: CC0 Public Domain (Free to use for any purpose).

    Attribution: Not required but appreciated. Intended Use: This dataset is for educational and research purposes only.

    This dataset is an excellent resource for aspiring data analysts, data scientists, and SQL learners who want to gain hands-on experience in healthcare fraud detection.

  16. Waddle Portfolio

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Colin Waddle (2025). Waddle Portfolio [Dataset]. https://www.kaggle.com/datasets/colindwaddle/waddle-portfolio
    Explore at:
    zip(4330358 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Authors
    Colin Waddle
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Highlighting both practice projects and contributions I've made on the job, with a focus on practical, results-driven analysis. Each project reflects my ability to solve business problems using tools like Excel for data visualization, SQL for querying and structuring data, and the skills I've built in Python.

  17. Superstore Sales EDA - Nawaf Alzzeer

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nawaf Alzeer (2025). Superstore Sales EDA - Nawaf Alzzeer [Dataset]. https://www.kaggle.com/datasets/nawafalzeer/superstore-sales-eda-nawaf-alzzeer
    Explore at:
    zip(809072 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Nawaf Alzeer
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Complete data engineering project on 4 years (2014-2017) of retail sales transactions.

    DATASET CONTENTS: - Original denormalized data (9,994 rows) - Normalized database: 4 tables (customers, orders, products, sales) - 9 SQL analysis files organized by phase - Complete EDA from data cleaning to business insights

    DATABASE TABLES: - customers: 793 records - orders: 4,931 records
    - products: 1,812 records - sales: 9,686 transactions

    KEY FINDINGS: - Low profitability: 12.44% margin (below industry standard) - Discount problem: 50%+ transactions have 20%+ discounts - Loss-making: 18.66% of transactions lose money - Furniture crisis: Only 2.31% margin - Small baskets: Only 1.96 items per order

    SQL SKILLS DEMONSTRATED: ✓ Window functions (ROW_NUMBER, PARTITION BY) ✓ Database normalization (3NF) ✓ Complex JOINs (3-4 tables) ✓ Data deduplication with CTEs ✓ Business analytics queries ✓ CASE statements and aggregations

    PERFECT FOR: - SQL practice (beginner to advanced) - Database normalization learning - EDA methodology study - Business analytics projects - Data engineering portfolios

    FILES INCLUDED: - 5 CSV files (original + 4 normalized tables) - 9 SQL query files (cleaning, migration, analysis)

    Author: Nawaf Alzzeer License: CC BY-SA 4.0

  18. Retail Sales, Returns & Shipping Dataset

    • kaggle.com
    zip
    Updated Aug 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kunal malviya (2025). Retail Sales, Returns & Shipping Dataset [Dataset]. https://www.kaggle.com/datasets/kunalmalviya06/retail-sales-returns-and-shipping-dataset
    Explore at:
    zip(632399 bytes)Available download formats
    Dataset updated
    Aug 15, 2025
    Authors
    kunal malviya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a comprehensive view of retail operations, combining sales transactions, return records, and shipping cost details into one analysis-ready package. It’s ideal for data analysts, business intelligence professionals, and students looking to practice Power BI, Tableau, or SQL projects focusing on sales performance, profitability, and operational cost analysis.

    Dataset Structure

    Orders Table – Detailed transactional data

    Row ID

    Order ID

    Order Date, Ship Date, Delivery Duration

    Ship Mode

    Customer ID, Customer Name, Segment, Country, City, State, Postal Code, Region

    Product ID, Category, Sub-Category, Product Name

    Sales, Quantity, Discount, Discount Value, Profit, COGS

    Returns Table – Return records by Order ID

    Returned (Yes/No)

    Order ID

    Shipping Cost Table – State-level shipping expenses

    State

    Shipping Cost Per Unit

    Potential Use Cases

    Calculate gross vs. net profit after considering returns and shipping costs.

    Perform regional sales and profit analysis.

    Identify high-return products and loss-making categories.

    Visualize KPIs in Power BI or Tableau.

    Build predictive models for returns or shipping costs.

    Source & Context The dataset is designed for educational and analytical purposes. It is inspired by retail and e-commerce operations data and was prepared for data analytics portfolio projects.

    License Open for use in learning, analytics projects, and data visualization practice.

  19. cyclistic-bike-share-2022-2024-clean

    • kaggle.com
    zip
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chathuranga Sudusinghe (2025). cyclistic-bike-share-2022-2024-clean [Dataset]. https://www.kaggle.com/datasets/indrajithsudusinghe/cyclistic-bike-share-2022-2024-clean
    Explore at:
    zip(579891587 bytes)Available download formats
    Dataset updated
    Nov 28, 2025
    Authors
    Chathuranga Sudusinghe
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Cyclistic Bike-Share Dataset (2022–2024) – Cleaned & Merged

    This dataset contains three full years (2022, 2023, and 2024) of publicly available Cyclistic bike-share trip data. All yearly files have been cleaned, standardized, and merged into a single high-quality master dataset for easy analysis.

    The dataset is ideal for:

    • Data Analysis & Visualization
    • SQL Projects
    • Python (Pandas) Practice
    • Power BI, Tableau Dashboards
    • Machine Learning Feature Engineering

    🔹 Key Cleaning & Processing Steps - Removed duplicate records - Handled missing values - Standardized column names - Converted date-time formats - Created calculated columns (ride length, day, month, etc.) - Merged yearly datasets into one master CSV file (3.17 GB)

    🔹 What You Can Analyze - Member vs Casual rider behavior - Peak riding hours and days - Monthly & seasonal trends - Trip duration patterns - Station usage & demand forecasting

    This dataset is especially useful for data analyst portfolio projects and technical interview preparation.

  20. Insurance Dataset for Data Engineering Practice

    • kaggle.com
    zip
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KPOVIESI Olaolouwa Amiche Stéphane (2025). Insurance Dataset for Data Engineering Practice [Dataset]. https://www.kaggle.com/datasets/kpoviesistphane/insurance-dataset-for-data-engineering-practice
    Explore at:
    zip(475362 bytes)Available download formats
    Dataset updated
    Sep 24, 2025
    Authors
    KPOVIESI Olaolouwa Amiche Stéphane
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Insurance Dataset for Data Engineering Practice

    Overview

    A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.

    Dataset Contents

    📊 Three Main Tables:

    • contracts.csv (~15,000 rows) - Insurance contracts with client information
    • claims.csv (~6,000 rows) - Insurance claims with damage and settlement details
    • vehicles.csv (~12,000 rows) - Vehicle information for auto insurance contracts

    🗺️ Geographic Coverage:

    • French cities with realistic postal codes
    • Risk zone classifications (High/Medium/Low)
    • Regional pricing coefficients

    🏷️ Product Types:

    • Auto Insurance (majority)
    • Home Insurance
    • Life Insurance
    • Health Insurance

    🎯 Intentional Data Quality Issues

    Perfect for practicing data cleaning and transformation:

    Date Format Issues:

    • Mixed formats: 2024-01-15, 15/01/2024, 01/15/2024
    • String storage requiring parsing and standardization

    Price Format Inconsistencies:

    • Multiple currency formats: 1250.50€, €1250.50, 1250.50 EUR, $1375.55
    • Missing currency symbols: 1250.50
    • Written formats: 1250.50 euros

    Missing Data Patterns:

    • Strategic missingness in age (8%), CSP (12%), expert_id (20-25%)
    • Realistic patterns based on business logic

    Categorical Inconsistencies:

    • Gender: M, F, Male, Female, empty strings
    • Power units: 150 HP, 150hp, 150 CV, 111 kW, missing values

    Data Type Issues:

    • Numeric values stored as strings
    • Mixed data types requiring casting

    🚀 Perfect for Practicing:

    PySpark Operations:

    • to_date() and date parsing functions
    • regexp_replace() for price cleaning
    • when().otherwise() conditional logic
    • cast() for data type conversions
    • fillna() and dropna() strategies

    Data Engineering Tasks:

    • ETL pipeline development
    • Data validation and quality checks
    • Join operations across related tables
    • Aggregation with business logic
    • Data standardization workflows

    Analytics & ML:

    • Customer segmentation
    • Claim frequency analysis
    • Premium pricing models
    • Risk assessment by geography
    • Churn prediction

    🏢 Business Context

    Realistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions

    💡 Use Cases:

    • Data Engineering Bootcamps: Hands-on PySpark practice
    • SQL Training: Complex joins and aggregations
    • Data Science Projects: End-to-end ML pipeline development
    • Business Intelligence: Dashboard and reporting practice
    • Data Quality Workshops: Cleaning and validation techniques

    🔧 Tools Compatibility:

    • Apache Spark / PySpark
    • Pandas / Python
    • SQL databases
    • Databricks
    • Google Cloud Dataflow
    • AWS Glue

    📈 Difficulty Level:

    Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.

    Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
Organization logo

S&P 500 Companies Analysis Project

A complete data analytics project combining Python, SQL, and Power BI to explore

Explore at:
zip(9721576 bytes)Available download formats
Dataset updated
Apr 6, 2025
Authors
anshadkaggle
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

Included Files:

sp500_cleaned.csv – Cleaned dataset used for analysis

sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

dashboard_screenshot.png – Screenshot of Power BI dashboard

README.md – Summary of the project and key takeaways

This project demonstrates practical data cleaning, querying, and visualization skills.

Search
Clear search
Close search
Google apps
Main menu