11 datasets found

Monday Coffee SQL Data Analysis Project
kaggle.com
Updated Nov 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Najir 0123 (2024). Monday Coffee SQL Data Analysis Project [Dataset]. https://www.kaggle.com/datasets/najir0123/monday-coffee-sql-data-analysis-project/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Najir 0123
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Najir 0123

Released under MIT

Contents
SQL Bike Stores
kaggle.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed ZRIRAK (2024). SQL Bike Stores [Dataset]. https://www.kaggle.com/datasets/mohamedzrirak/sql-bkestores
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed ZRIRAK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.

Key Objectives:

Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:

SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:

Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.
H
Data Management Project for Collaborative Groundwater Research
hydroshare.org
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor (2025). Data Management Project for Collaborative Groundwater Research [Dataset]. https://www.hydroshare.org/resource/faa268eaa07547938d0e696247fc81fd
Explore at:
zip(2.1 GB)Available download formats
Dataset updated
Apr 24, 2025
Dataset provided by
HydroShare
Authors
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.
Employee Attrition Case Study
kaggle.com
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hunter Gonzalez (2023). Employee Attrition Case Study [Dataset]. https://www.kaggle.com/datasets/huntergonzalez247/employee-attrition-case-study
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hunter Gonzalez
Description
This is a in-depth analysis I have created using data pulled from an open source (ODbL) data project that was provided on Kaggle:

Pavansubhash. (2017). IBM HR Analytics Employee Attrition & Performance, Version 1. Retrieved August 3rd, 2023 from https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset.

Problem: The VP of People Operations/HR at [Company] wants to better understand what efforts they can make to retain more employees every year.

Question: How does education, job involvement, and work life balance effect employee attrition?

Metrics

A Survey was sent out 2068 current and past employees which asked a series of clear and consist questions inquiring about different variables involving the workplace. The surveys where anonymous to assure that employees answered truthfully and protecting the integrity of the data collected.

Education: 1)Below College 2)Some College 3)Bachelor 4)Master 5)Doctor

Job Involvement: 1)Low 2)Medium 3)High 4)Very High

Work Life Balance: 1)Bad 2)Good 3)Better 4)Best
Data from: myPhyloDB
catalog.data.gov
data.amerigeoss.org
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). myPhyloDB [Dataset]. https://catalog.data.gov/dataset/myphylodb-c588e
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
myPhyloDB is an open-source software package aimed at developing a user-friendly web-interface for accessing and analyzing all of your laboratory's microbial ecology data (currently supported project types: soil, air, water, microbial, and human-associated). The storage and handling capabilities of myPhyloDB archives users' raw sequencing files, and allows for easy selection of any combination of projects/samples from all of your projects using the built-in SQL database. The data processing capabilities of myPhyloDB are also flexible enough to allow the upload, storage, and analysis of pre-processed data or raw (454 or Illumina) data files using the built-in versions of Mothur and R. myPhyloDB is designed to run as a local web-server, which allows a single installation to be accessible to all of your laboratory members, regardless of their operating system or other hardware limitations. myPhyloDB includes an embedded copy of the popular Mothur program and uses a customizable batch file to perform sequence editing and processing. This allows myPhyloDB to leverage the flexibility of Mothur and allow for greater standardization of data processing and handling across all of your sequencing projects. myPhyloDB also includes an embedded copy of the R software environment for a variety of statistical analyses and graphics. Currently, myPhyloDB includes analysis for factor or regression-based ANcOVA, principal coordinates analysis (PCoA), differential abundance analysis (DESeq), and sparse partial least-squares regression (sPLS). Resources in this dataset:Resource Title: Website Pointer to myPhyloDB. File Name: Web Page, url: https://myphylodb.azurecloudgov.us/myPhyloDB/home/ Provides information and links to download latest version, release history, documentation, and tutorials including type of analysis you would like to perform (Univariate: ANCOVA/GLM; Multivariate: DiffAbund, PcoA, or sPLS).
AdventureWorks-2014
kaggle.com
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick McKown (2024). AdventureWorks-2014 [Dataset]. https://www.kaggle.com/datasets/duckduckboot/adventureworks-2014
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Patrick McKown
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About This Dataset

This dataset is derived from the AdventureWorks 2014 test database published by Microsoft, and is designed to simplify and enhance data analysis workflows. The dataset consists of multiple CSV files that have been pre-joined and transformed from the original SQL database, facilitating a smoother analytical experience in Python.

Dataset Composition

The dataset includes: * SalesOrderHeader: Integrates the sales header and sales item tables, providing a unified view of sales transactions. * CustomerMaster: Combines customer names, countries, addresses, and other related information into a single, comprehensive file. * VendorMaster: Combines vendor names, countries, addresses, and other related information into a single, comprehensive file.

These pre-joined CSVs aim to streamline data analysis, making it more accessible for users working in Python. The dataset can be used to showcase various Python projects or as a foundation for your own analyses.

Usage

Feel free to leverage this dataset for your data analysis projects, explore trends, and create visualizations. Whether you're showcasing your own Python projects or conducting independent analyses, this dataset is designed to support a wide range of data science tasks.

Documentation

For those interested in recreating the CSV files from the SQL database, detailed documentation is included at the bottom of this section. It provides step-by-step instructions on how to replicate the CSVs from the AdventureWorks 2014 database using SQL queries.

AdventureWorks_SalesOrderHeader

SELECT SalesOrderID , CAST (OrderDate AS date) AS OrderDate , CAST (ShipDate AS date) AS ShipDate , CustomerID , ShipToAddressID , BillToAddressID , SubTotal , TaxAmt , Freight , TotalDue FROM Sales.SalesOrderHeader

AdventureWorks_CustomerMaster

SELECT pa.AddressID , pbea.BusinessEntityID , pa.AddressLine1 , pa.City , pa.PostalCode , psp.[Name] AS ProvinceStateName , pat.[Name] AS AddressType , pea.EmailAddress , ppp.PhoneNumber , pp.FirstName , pp.LastName , sst.CountryRegionCode , pcr.[Name] AS CountryName , sst.[Group] AS CountryGroup FROM Person.[Address] AS pa INNER JOIN Person.BusinessEntityAddress AS pbea ON pa.AddressID = pbea.AddressID INNER JOIN Person.StateProvince AS psp ON pa.StateProvinceID = psp.StateProvinceID INNER JOIN Person.AddressType AS pat ON pbea.AddressTypeID = pat.AddressTypeID INNER JOIN Person.EmailAddress AS pea ON pbea.BusinessEntityID = pea.BusinessEntityID INNER JOIN Person.Person AS pp ON pbea.BusinessEntityID = pp.BusinessEntityID INNER JOIN Person.PersonPhone AS ppp ON pbea.BusinessEntityID = ppp.BusinessEntityID INNER JOIN Sales.SalesTerritory AS sst ON psp.TerritoryID = sst.TerritoryID INNER JOIN Person.CountryRegion AS pcr ON sst.CountryRegionCode = pcr.CountryRegionCode;
Source files For Bike Share Case Study
kaggle.com
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MG (2022). Source files For Bike Share Case Study [Dataset]. https://www.kaggle.com/datasets/magdas0/source-files-for-bike-share-case-study
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
MG
Description
Bike Share Case Study

This case study has been prepared as a partial fulfillment for the Capstone project, the final course in Google Data Analytics offered by Google at the Coursera platform.

I created a dataset that contains source files I wrote to perform this analysis:

Files presenting and documenting the analysis

2022-08-04-bike-share-pres.pdf - the final presentation of the results including diagrams, conclusions and recommendations, and

2022-08-04-bike-share-report.pdf - document describing all stages of the project

scripts - R, bash, and SQL scripts I created and used for this project

spreadsheets - spreadsheets I created and used for this project

The original data regarding bike sharing program is available publicly. The link is provided in the presentation and in the report.
Hospital Management Dataset
kaggle.com
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanak Baghel (2025). Hospital Management Dataset [Dataset]. https://www.kaggle.com/datasets/kanakbaghel/hospital-management-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanak Baghel
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.

Dataset Overview

This dataset includes five CSV files:

patients.csv – Patient demographics, contact details, registration info, and insurance data

doctors.csv – Doctor profiles with specializations, experience, and contact information

appointments.csv – Appointment dates, times, visit reasons, and statuses

treatments.csv – Treatment types, descriptions, dates, and associated costs

billing.csv – Billing amounts, payment methods, and status linked to treatments

📁 Files & Column Descriptions

** patients.csv**

Contains patient demographic and registration details.

Column Description

patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address

** doctors.csv**

Details about the doctors working in the hospital.

Column Description

doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address

appointments.csv

Records of scheduled and completed patient appointments.

Column Description

appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)

treatments.csv

Information about the treatments given during appointments.

Column Description

treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given

** billing.csv**

Billing and payment details for treatments.

Column Description

bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)

Possible Use Cases

SQL queries and relational database design

Exploratory data analysis (EDA) and dashboarding

Machine learning projects (e.g., cost prediction, no-show analysis)

Feature engineering and data cleaning practice

End-to-end healthcare analytics workflows

Recommended Tools & Resources

SQL (joins, filters, window functions)

Pandas and Matplotlib/Seaborn for EDA

Scikit-learn for ML models

Pandas Profiling for automated EDA

Plotly for interactive visualizations

Please Note that :

All data is synthetically generated for educational and project use. No real patient information is included.

If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Z
Synthesizing Patches in Dockerfiles: Base Dataset
data.niaid.nih.gov
zenodo.org
Updated Jan 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Bointner (2023). Synthesizing Patches in Dockerfiles: Base Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7508292
Explore at:
Dataset updated
Jan 28, 2023
Dataset provided by
Markus Bointner
Jürgen Cito
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Docker containers are a standardized way of packaging applications and their execution environment in a reproducible manner. This dataset is an extension of an existing docker dataset with over 100,000 Dockerfiles in 15,000 projects (https://zenodo.org/record/1200869/).

This dataset was used to extract patching patterns for Dockerfiles with the goal of improving quality in an automatic fashion.

The extension of the original dataset includes:

Static analysis results of every version of every Dockerfile

Vulnerability data from analysing a limited amount of built Docker images

A second database containing quality patches based on the static analysis results

Files:

msr18_extended A compressed, binary PostgreSQL database dump of the docker dataset extended with analysis results

patch_database A compressed, binary PostgreSQL database dump of extracted patches

patch_datbase.sql A PostgreSQL plain SQL database dump of extracted patches

Furhter information on the artifact used to extract and apply patches and instructions to import the database dumps are provided here: https://github.com/mandoway/dfp
My analysis of the "bike share" data: Google S.
kaggle.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lamar McMillan (2025). My analysis of the "bike share" data: Google S. [Dataset]. https://www.kaggle.com/lamarmcmillan/my-analysis-of-the-bike-share-data-google-s/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lamar McMillan
Description
Context

** One analysis Done in spreadsheets with 202004 and 202005 data **

Content

To adjust for outlier Ride lengths like the max and min below: Max RL =MAX(N:N)978:40:02 minimum RL =MIN(N:N)-0:02:56

TRIMMean to shave off the top and bottom of a dataset. TRIMMEAN =TRIMMEAN(N:N,5%)0:20:20 =TRIMMEAN(N:N,2%)0:21:27

Otherwise the Ride length for 202004 is Average RL 0:35:51

The most common day of the week is Sunday. There are 61,148 members and 23,628 casual riders. mode of DOW 1 CountIf member of MC 61148 CountIf casual of MC 23628

Pivot table 1 2020-04 member_casual AVERAGE of ride_length

Same calculations for 2020-05 Average RL 0:33:23 Max RL 481:36:53 minimum RL -0:01:48 mode of DOW 7 CountIf member of MC 113365 CountIf casual of MC 86909 TRIMMEAN 0:25:22 0:26:59

There are 4 pivot tables included in seperate sheets for other comparisons.

Acknowledgements

I gathered this data using the sources provided by the Google Data Analytics course. All work seen is done by myself.

Inspiration

I want to further use the data in SQL, and Tableau.
SQLite Sakila Sample Database
kaggle.com
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atanas Kanev (2021). SQLite Sakila Sample Database [Dataset]. https://www.kaggle.com/atanaskanev/sqlite-sakila-sample-database/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atanas Kanev
Description
SQLite Sakila Sample Database

Database Description

The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/

Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:

Oracle

SQL Server

SQLIte

Interbase/Firebird

Microsoft Access

Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance

License: BSD Copyright DB Software Laboratory http://www.etl-tools.com

Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html

Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/

Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db

https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila

Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/

Files Description

The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:

sqlite> .open sqlite-sakila.db # creates the .db file sqlite> .read sqlite-sakila-schema.sql # creates the database schema sqlite> .read sqlite-sakila-insert-data.sql # inserts the data

Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Najir 0123 (2024). Monday Coffee SQL Data Analysis Project [Dataset]. https://www.kaggle.com/datasets/najir0123/monday-coffee-sql-data-analysis-project/suggestions

Monday Coffee SQL Data Analysis Project

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 15, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Najir 0123

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by Najir 0123

Released under MIT

Clear search

Close search

Google apps

Main menu

Monday Coffee SQL Data Analysis Project

Dataset

Contents

SQL Bike Stores

Data Management Project for Collaborative Groundwater Research

Employee Attrition Case Study

Data from: myPhyloDB

AdventureWorks-2014

About This Dataset

Dataset Composition

Usage

Documentation

AdventureWorks_SalesOrderHeader

AdventureWorks_CustomerMaster

Source files For Bike Share Case Study

Bike Share Case Study

Hospital Management Dataset

Synthesizing Patches in Dockerfiles: Base Dataset

My analysis of the "bike share" data: Google S.

Context

Content

Acknowledgements

Inspiration

SQLite Sakila Sample Database

SQLite Sakila Sample Database

Database Description

Files Description

Monday Coffee SQL Data Analysis Project

Dataset

Contents