Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Abdallah Nasser
Released under Apache 2.0
Facebook
Twitter• Conducted an in-depth Exploratory Data Analysis (EDA) using MySQL on a comprehensive dataset of tech layoffs from March 2020 to present, sourced from Kaggle.
• Utilized advanced SQL queries to extract, clean, and analyze large datasets, uncovering significant insights into the timing, frequency, and scale of layoffs across various tech companies and regions.
Facebook
TwitterSQL Case Study Project: Employee Database Analysis 📊
I recently completed a comprehensive SQL project involving a simulated employee database with multiple tables:
In this project, I practiced and applied a wide range of SQL concepts:
✅ Simple Queries ✅ Filtering with WHERE conditions ✅ Sorting with ORDER BY ✅ Aggregation using GROUP BY and HAVING ✅ Multi-table JOINs ✅ Conditional Logic using CASE ✅ Subqueries and Set Operators
💡 Key Highlights:
🛠️ Tools Used: Azure Data Studio
📂 You can find the entire project and scripts here:
👉 https://github.com/RiddhiNDivecha/Employee-Database-Analysis
This project helped me sharpen my SQL skills and understand business logic more deeply in a practical context.
💬 I’m open to feedback and happy to connect with fellow data enthusiasts!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains comprehensive synthetic healthcare data designed for fraud detection analysis. It includes information on patients, healthcare providers, insurance claims, and payments. The dataset is structured to mimic real-world healthcare transactions, where fraudulent activities such as false claims, overbilling, and duplicate charges can be identified through advanced analytics.
The dataset is suitable for practicing SQL queries, exploratory data analysis (EDA), machine learning for fraud detection, and visualization techniques. It is designed to help data analysts and data scientists develop and refine their analytical skills in the healthcare insurance domain.
Dataset Overview The dataset consists of four CSV files:
Patients Data (patients.csv)
Contains demographic details of patients, such as age, gender, insurance type, and location. Can be used to analyze patient demographics and healthcare usage patterns. Providers Data (providers.csv)
Contains information about healthcare providers, including provider ID, specialty, location, and associated hospital.
Useful for identifying fraudulent claims linked to specific providers or hospitals. Claims Data (claims.csv)
Contains records of insurance claims made by patients, including diagnosis codes, treatment details, provider ID, and claim amount.
Can be analyzed for suspicious patterns, such as excessive claims from a single provider or duplicate claims for the same patient.
Payments Data (payments.csv) Contains details of claim payments made by insurance companies, including payment amount, claim ID, and reimbursement status.
Helps in detecting discrepancies between claims and actual reimbursements. Possible Analysis Ideas
This dataset allows for multiple analysis approaches, including but not limited to:
🔹 Fraud Detection: Identify patterns in claims data to detect fraudulent activities (e.g., excessive billing, duplicate claims). 🔹 Provider Behavior Analysis: Analyze providers who have an unusually high claim volume or high rejection rates. 🔹 Payment Trends: Compare claims vs. payments to find irregularities in reimbursement patterns. 🔹 Patient Demographics & Utilization: Explore which patient groups are more likely to file claims and receive reimbursements. 🔹 SQL Query Practice: Perform advanced SQL queries, including joins, aggregations, window functions, and subqueries, to extract insights from the data.
Use Cases Practicing SQL queries for job interviews and real-world projects. Learning data cleaning, data wrangling, and feature engineering for healthcare analytics. Applying machine learning techniques for fraud detection. Gaining insights into the healthcare insurance domain and its challenges.
License & Usage License: CC0 Public Domain (Free to use for any purpose).
Attribution: Not required but appreciated. Intended Use: This dataset is for educational and research purposes only.
This dataset is an excellent resource for aspiring data analysts, data scientists, and SQL learners who want to gain hands-on experience in healthcare fraud detection.
Facebook
TwitterThe Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/
Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:
Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance
License: BSD Copyright DB Software Laboratory http://www.etl-tools.com
Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html
Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/
Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db
https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila
Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/
The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:
sqlite> .open sqlite-sakila.db # creates the .db file
sqlite> .read sqlite-sakila-schema.sql # creates the database schema
sqlite> .read sqlite-sakila-insert-data.sql # inserts the data
Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
📊 Bank Transaction Analytics Dashboard – SQL + Excel
🔹 Overview
This project focuses on Bank Transaction Analysis using a combination of SQL scripts and Excel dashboards. The goal is to provide insights into customer spending patterns, payment modes, suspicious transactions, and overall financial trends.
The dataset and analysis files can help learners and professionals understand how SQL and Excel can be used together for business decision-making, customer behavior tracking, and data-driven insights.
🔹 Contents
The dataset includes the following resources:
📂 SQL Scripts:
Create & Insert tables
15 Basic Queries
15 Advanced Queries
📂 CSV File:
Bank Transaction Analytics.csv (main dataset)
📂 Excel Charts:
Pie, Bar, Column, Line, Doughnut charts
Final Interactive Dashboard
📂 Screenshots:
Query outputs, Charts, and Final Dashboard visualization
📂 PDF Reports:
Project Report
Dashboard Report
📄 README.md:
Complete documentation and step-by-step explanation
🔹 Key Insights
26–35 age group spent the most across categories.
Amazon identified as the top merchant.
NetBanking showed the highest share compared to POS/UPI.
Travel & Shopping emerged as dominant categories.
🔹 Applications
Detecting suspicious transactions.
Understanding customer behavior.
Identifying top merchants and categories.
Building business intelligence dashboards.
🔹 How to Use
Download the dataset and SQL scripts.
Run Bank_Transaction_Analytics.SQL to create and insert data.
Execute the queries (Basic + Advanced) for insights.
Open Excel files to explore interactive charts and dashboards.
Refer to Project Report PDF for documentation.
🔹 Author
👩💻 Created by: Prachi Singh
GitHub: Bank Transaction Analytics Dashboard(https://github.com/prachi-singh-ds/Bank-Transaction-Analytics-Dashboard)
⚡This project is a complete SQL + Excel integration case study and is suitable for Data Science, Business Analytics, and Data Engineering portfolios.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database was compiled during the EagleEye project (https://cordis.europa.eu/project/id/101059253), which focused on developing a novel 3D printer for high-resolution, large-area printing using digital light projection and two-photon polymerization. It contains essential printing parameters and their relationships to other key factors, such as wavelengths, laser specifications, and photosensitive materials. The data is stored in .csv and .sql formats, making it suitable for both basic and advanced tasks.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Complete data engineering project on 4 years (2014-2017) of retail sales transactions.
DATASET CONTENTS: - Original denormalized data (9,994 rows) - Normalized database: 4 tables (customers, orders, products, sales) - 9 SQL analysis files organized by phase - Complete EDA from data cleaning to business insights
DATABASE TABLES:
- customers: 793 records
- orders: 4,931 records
- products: 1,812 records
- sales: 9,686 transactions
KEY FINDINGS: - Low profitability: 12.44% margin (below industry standard) - Discount problem: 50%+ transactions have 20%+ discounts - Loss-making: 18.66% of transactions lose money - Furniture crisis: Only 2.31% margin - Small baskets: Only 1.96 items per order
SQL SKILLS DEMONSTRATED: ✓ Window functions (ROW_NUMBER, PARTITION BY) ✓ Database normalization (3NF) ✓ Complex JOINs (3-4 tables) ✓ Data deduplication with CTEs ✓ Business analytics queries ✓ CASE statements and aggregations
PERFECT FOR: - SQL practice (beginner to advanced) - Database normalization learning - EDA methodology study - Business analytics projects - Data engineering portfolios
FILES INCLUDED: - 5 CSV files (original + 4 normalized tables) - 9 SQL query files (cleaning, migration, analysis)
Author: Nawaf Alzzeer License: CC BY-SA 4.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The "US Congressional Tweets Dataset" is a comprehensive collection of tweets from US Congressional members spanning from 2008 to 2017. This dataset is valuable for organizations like Lobbyists4America, which aims to gain insights into legislative trends and influences for effective lobbying strategies. The dataset is structured into two primary components: users_df and tweets_df.
users_df: This DataFrame provides detailed information about the Twitter accounts of various congressional members. It includes a range of attributes such as:
created_at), follower and friend counts (followers_count, friends_count).contributors_enabled, default_profile, is_translator, etc.tweets_df: This DataFrame contains the actual tweet data from these congressional accounts. Key columns include:
created_at: The timestamp of the tweet.favorite_count and retweet_count: Indicators of the tweet's popularity.text: The text content of the tweet.user_id, lang (language), and source (device/app used for tweeting).possibly_sensitive, quoted_status_id, and engagement-related fields.The dataset is utilized for various analyses, including:
Network Analysis: Exploring the connections and interactions between different congressional members on Twitter, potentially revealing influential figures or groups within Congress.
Sentiment Analysis: Using libraries like TextBlob and NLTK, this analysis assesses the sentiment (positive, negative, neutral) of tweets to understand the general tone and stance of congressional members on various issues.
Correlation Analysis: Investigating relationships between different numerical features in the dataset, such as whether higher tweet frequencies correlate with more followers.
Word Clustering/Topic Modeling: Utilizing NMF (Non-Negative Matrix Factorization) from scikit-learn to cluster words and identify major themes or topics discussed in the tweets.
Time Series Analysis: Observing trends and patterns in tweeting behavior over time, such as increased activity around elections or significant political events.
The "US Congressional Tweets Dataset" is a rich source for analyzing the digital footprint of US Congressional members. Through the application of various data science techniques, Lobbyists4America can extract meaningful insights about political sentiments, networking patterns, and topical trends among lawmakers. This information is crucial for tailoring lobbying efforts and understanding the legislative landscape.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Motiavtion The motivation behind this research stems from the pressing need to improve the prediction and management of cardiovascular disease (CVD), a leading cause of mortality worldwide. Despite advancements in medical science, there remains a significant challenge in accurately predicting and detecting CVD in its early stages. This study seeks to address this challenge by leveraging machine learning (ML) and deep learning (DL) models to analyze physiological signs associated with CVD, including respiratory rate, blood pressure, body temperature, heart rate, and oxygen saturation. By comparing the performance of various ML and DL models with a previous study conducted by Ashfaq et al., we aim to identify the most effective prediction model. The potential of achieving high accuracy rates, as demonstrated by the MLP model in our research, offers promising prospects for enhancing CVD prediction and management strategies. These findings hold implications not only for medical researchers and practitioners but also for individuals, academies, analysts, and AI enthusiasts interested in advancing healthcare technology. Furthermore, the integration of these predictive models into monitoring systems using body sensors could revolutionize the way CVD patients are managed. Real-time monitoring facilitated by advanced ML and DL algorithms could enable prompt emergency intervention, potentially saving lives and improving patient outcomes. Overall, this research contributes to the growing body of knowledge in the field of cardiovascular disease prediction and underscores the transformative potential of AI-driven approaches in healthcare.
About the dataset In this project, we successfully utilized the MIMIC-III clinical database, renowned for its vast collection of deidentified clinical data from over 50,001 critically ill patients treated at Beth Israel Deaconess Medical Center between 2001 and 2012, as underscored by Johnson, Pollard, and Mark (2016). This database encompassed a comprehensive array of demographic information, vital signs, lab test results, treatments, medications, written notes, imaging reports, and post-hospital outcomes. Leveraging the accessibility of Google's Big Query cloud and Amazon's AWS cloud, we employed 'Amazon S3' to seamlessly extract the necessary data for our cardiovascular disease forecasting analysis.
Data Processing Data Pre-Processing: We meticulously cleaned and prepared the raw dataset, following established procedures outlined by Chaki and Ucar (2023) and Mishra et al. (2020). This involved removing duplicates, correcting anomalies, and addressing missing values to ensure dataset accuracy. Our project efficiently utilized SQL, a standard language for relational databases, along with Amazon Web Services (AWS) Athena, a SQL-based query tool, to retrieve essential data from the MIMIC-III clinical database. Leveraging SQL queries, we accessed vital information including pulse rate, blood pressure, blood oxygen saturation, respiration rate, and body temperature. AWS Athena proved instrumental in seamlessly querying data stored on AWS, enabling swift data retrieval. Through the Athena interface, we executed SQL queries to extract the desired dataset, subsequently saving it to a CSV file for further analysis. This approach significantly streamlined the process of obtaining relevant data from the MIMIC-III database, highlighting the efficiency of SQL and AWS Athena in data retrieval for our research endeavors.
Dealing with Outliers: We carefully evaluated our dataset for outliers and retained them, as they did not significantly deviate from the mean or standard deviation, thereby maintaining the integrity of our analysis.
Data Transformation: Our team successfully transformed the dataset into a usable format with properly labeled variables, as outlined by Lachlan (2017). We opted not to scale variables to ensure accurate interpretation.
Exploratory Data Analysis (EDA): Through comprehensive EDA, we identified trends and patterns in the data, facilitating hypothesis testing and informing model development.
Model Building: Utilizing methodologies outlined by Janiesch, Zschech, & Heinrich (2021), we developed machine learning (ML) and deep learning (DL) models tailored to our research goals and dataset characteristics.
Model Selection: We carefully selected appropriate algorithms, considering factors such as data nature, complexity, and available resources, as suggested by Ghosh and Dasgupta (2022).
Human Biophysical Parameters The project is built upon the foundation of human biophysical parameters, serving as crucial indicators for monitoring and intervening in cardiovascular disease (CVD) patients, facilitating both long-term and near-term risk assessment. These parameters, including heart rate, respiration rate, blood pressure, and oxygen saturation, play a pivotal role in ass...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive collection of YouTube video and channel metadata curated for data analysis, visualization, and storytelling projects. It contains rich information on trending videos across multiple countries, including video performance statistics, engagement metrics, and channel-level details.
The dataset is designed to help learners and researchers explore real-world YouTube dynamics, such as: • What type of content gains the highest views and engagement? • How do categories perform across different countries? • What role do publishing time, video duration, or tags play in driving popularity? • Which channels dominate in terms of subscribers, views, and content consistency?
Features
The dataset includes detailed video-level fields such as: • Video ID, title, description, and publish time • Trending date and country • Tags, categories, duration, resolution, and licensed content status • Views, likes, and comment counts
Alongside channel-level information including: • Channel ID, title, and description • Channel country, publish date, and custom URL (if available) • Subscriber count, total views, video count, and hidden subscriber flag
With this structured dataset, students and professionals can perform data cleaning, transformation, SQL querying, trend analysis, and dashboarding in tools such as Excel, SQL, Power BI, Tableau, and Python. It is also suitable for advanced machine learning tasks like predicting video performance, engagement modeling, and natural language processing on video titles and descriptions.
Use Cases 1. Descriptive Analytics: Identify top categories, channels, and countries leading the YouTube trending space. 2. Comparative Analysis: Compare engagement rates across different regions and content types. 3. Visualization Projects: Create dashboards showing performance KPIs, category trends, and time-based patterns. 4. Storytelling: Derive business insights and best practices for creators, marketers, and educators on YouTube.
Educational Value
This dataset is structured specifically for student projects and group assignments. It ensures every learner can take a role—whether as a data engineer, analyst, visualization specialist, or business storyteller—mirroring the structure of real-world consulting projects.
Credits
This dataset is published as part of the YouTube Data Analytics Project initiated by Analytics Circle, an institute dedicated to empowering learners with practical data analytics, data science, and AI skills through hands-on projects and real-world applications.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NOTE: Please Read Text File named "ERD Relationship Text" for Detailed Information.
This dataset represents a complete healthcare management system modeled as a relational database containing over 20 interlinked tables. It captures the entire lifecycle of healthcare operations from patient registration to diagnosis, treatment, billing, inventory, and vendor management. The data structure is designed to simulate a real-world hospital information system (HIS), enabling advanced analytics, data modeling, and visualization. You can easily visualize and explore the schema using tools like dbdiagram.io by pasting the provided table definitions.
The dataset covers multiple operational areas of a hospital including patient information, clinical operations, financial transactions, human resources, and logistics.
Patient Information includes personal, contact, and emergency details, along with identification and insurance. Clinical Operations include visits, appointments, diagnoses, treatments, and medications. Financial Transactions cover bills, payments, and vendor settlements. Human Resources include staff details, departments, and medical teams. Logistics and Inventory include equipment, medicines, supplies, and vendor relationships.
This dataset can be used for data modeling and SQL practice for complex joins and normalization, healthcare analytics projects involving cost analysis, treatment efficiency, and patient demographics, visualization projects in Power BI, Tableau, or Domo for operational insights, building ETL pipelines and data warehouse models for healthcare systems, and machine learning applications such as predicting patient readmission, billing anomalies, or treatment outcomes.
To explore the data relationships visually, go to dbdiagram.io, paste the entire provided schema code, and press 2 then 1 (or 2 and Enter) to auto-align the diagram. You’ll see an interactive Entity Relationship Diagram (ERD) representing the entire healthcare ecosystem.
Total Tables: 20+ Total Columns: 200+ Primary Focus: Patient Management, Clinical Operations, Billing, and Supply Chain
Facebook
TwitterRevenue management is more crucial than ever to run a successful and profitable hotel. With all the information that's now readily accessible and there are different ways to track and analyze it, your business has a wealth of new opportunities. Successful hoteliers continuously learn and improve their methods to stay one step ahead of their competition. Revenue management strategies are used by only a small percentage of independent hoteliers, limiting their revenue-generating potential.
In this Hotel-revenue project, I will address a few questions a hotel management team faces.
The questions are outlined below: 1) What is the hotel revenue growth per year? 2)Is there any relation between guest and their personal cars? 2) Is there any kind of trends/patterns observed in the data?
These questions are solved using data-driven technologies. In this project, I will use Python and write SQL queries to solve these questions.
This dataset can be used for learning purposes. Future work would be to perform advance machine learning algorithms and forecasting techniques that can generate enormous insights that could help the hotel management company to outline different strategies and business planning.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
In the case study titled "Blinkit: Grocery Product Analysis," a dataset called 'Grocery Sales' contains 12 columns with information on sales of grocery items across different outlets. Using Tableau, you as a data analyst can uncover customer behavior insights, track sales trends, and gather feedback. These insights will drive operational improvements, enhance customer satisfaction, and optimize product offerings and store layout. Tableau enables data-driven decision-making for positive outcomes at Blinkit.
The table Grocery Sales is a .CSV file and has the following columns, details of which are as follows:
• Item_Identifier: A unique ID for each product in the dataset. • Item_Weight: The weight of the product. • Item_Fat_Content: Indicates whether the product is low fat or not. • Item_Visibility: The percentage of the total display area in the store that is allocated to the specific product. • Item_Type: The category or type of product. • Item_MRP: The maximum retail price (list price) of the product. • Outlet_Identifier: A unique ID for each store in the dataset. • Outlet_Establishment_Year: The year in which the store was established. • Outlet_Size: The size of the store in terms of ground area covered. • Outlet_Location_Type: The type of city or region in which the store is located. • Outlet_Type: Indicates whether the store is a grocery store or a supermarket. • Item_Outlet_Sales: The sales of the product in the particular store. This is the outcome variable that we want to predict.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Abdallah Nasser
Released under Apache 2.0