MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Najir 0123
Released under MIT
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.
Key Objectives:
Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:
SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:
Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.
This is a in-depth analysis I have created using data pulled from an open source (ODbL) data project that was provided on Kaggle:
Pavansubhash. (2017). IBM HR Analytics Employee Attrition & Performance, Version 1. Retrieved August 3rd, 2023 from https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset.
Problem: The VP of People Operations/HR at [Company] wants to better understand what efforts they can make to retain more employees every year.
Question: How does education, job involvement, and work life balance effect employee attrition?
Metrics
A Survey was sent out 2068 current and past employees which asked a series of clear and consist questions inquiring about different variables involving the workplace. The surveys where anonymous to assure that employees answered truthfully and protecting the integrity of the data collected.
Education: 1)Below College 2)Some College 3)Bachelor 4)Master 5)Doctor
Job Involvement: 1)Low 2)Medium 3)High 4)Very High
Work Life Balance: 1)Bad 2)Good 3)Better 4)Best
myPhyloDB is an open-source software package aimed at developing a user-friendly web-interface for accessing and analyzing all of your laboratory's microbial ecology data (currently supported project types: soil, air, water, microbial, and human-associated). The storage and handling capabilities of myPhyloDB archives users' raw sequencing files, and allows for easy selection of any combination of projects/samples from all of your projects using the built-in SQL database. The data processing capabilities of myPhyloDB are also flexible enough to allow the upload, storage, and analysis of pre-processed data or raw (454 or Illumina) data files using the built-in versions of Mothur and R. myPhyloDB is designed to run as a local web-server, which allows a single installation to be accessible to all of your laboratory members, regardless of their operating system or other hardware limitations. myPhyloDB includes an embedded copy of the popular Mothur program and uses a customizable batch file to perform sequence editing and processing. This allows myPhyloDB to leverage the flexibility of Mothur and allow for greater standardization of data processing and handling across all of your sequencing projects. myPhyloDB also includes an embedded copy of the R software environment for a variety of statistical analyses and graphics. Currently, myPhyloDB includes analysis for factor or regression-based ANcOVA, principal coordinates analysis (PCoA), differential abundance analysis (DESeq), and sparse partial least-squares regression (sPLS). Resources in this dataset:Resource Title: Website Pointer to myPhyloDB. File Name: Web Page, url: https://myphylodb.azurecloudgov.us/myPhyloDB/home/ Provides information and links to download latest version, release history, documentation, and tutorials including type of analysis you would like to perform (Univariate: ANCOVA/GLM; Multivariate: DiffAbund, PcoA, or sPLS).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is derived from the AdventureWorks 2014 test database published by Microsoft, and is designed to simplify and enhance data analysis workflows. The dataset consists of multiple CSV files that have been pre-joined and transformed from the original SQL database, facilitating a smoother analytical experience in Python.
The dataset includes: * SalesOrderHeader: Integrates the sales header and sales item tables, providing a unified view of sales transactions. * CustomerMaster: Combines customer names, countries, addresses, and other related information into a single, comprehensive file. * VendorMaster: Combines vendor names, countries, addresses, and other related information into a single, comprehensive file.
These pre-joined CSVs aim to streamline data analysis, making it more accessible for users working in Python. The dataset can be used to showcase various Python projects or as a foundation for your own analyses.
Feel free to leverage this dataset for your data analysis projects, explore trends, and create visualizations. Whether you're showcasing your own Python projects or conducting independent analyses, this dataset is designed to support a wide range of data science tasks.
For those interested in recreating the CSV files from the SQL database, detailed documentation is included at the bottom of this section. It provides step-by-step instructions on how to replicate the CSVs from the AdventureWorks 2014 database using SQL queries.
SELECT
SalesOrderID
, CAST (OrderDate AS date) AS OrderDate
, CAST (ShipDate AS date) AS ShipDate
, CustomerID
, ShipToAddressID
, BillToAddressID
, SubTotal
, TaxAmt
, Freight
, TotalDue
FROM
Sales.SalesOrderHeader
SELECT
pa.AddressID
, pbea.BusinessEntityID
, pa.AddressLine1
, pa.City
, pa.PostalCode
, psp.[Name] AS ProvinceStateName
, pat.[Name] AS AddressType
, pea.EmailAddress
, ppp.PhoneNumber
, pp.FirstName
, pp.LastName
, sst.CountryRegionCode
, pcr.[Name] AS CountryName
, sst.[Group] AS CountryGroup
FROM
Person.[Address] AS pa
INNER JOIN
Person.BusinessEntityAddress AS pbea ON pa.AddressID = pbea.AddressID
INNER JOIN
Person.StateProvince AS psp ON pa.StateProvinceID = psp.StateProvinceID
INNER JOIN
Person.AddressType AS pat ON pbea.AddressTypeID = pat.AddressTypeID
INNER JOIN
Person.EmailAddress AS pea ON pbea.BusinessEntityID = pea.BusinessEntityID
INNER JOIN
Person.Person AS pp ON pbea.BusinessEntityID = pp.BusinessEntityID
INNER JOIN
Person.PersonPhone AS ppp ON pbea.BusinessEntityID = ppp.BusinessEntityID
INNER JOIN
Sales.SalesTerritory AS sst ON psp.TerritoryID = sst.TerritoryID
INNER JOIN
Person.CountryRegion AS pcr ON sst.CountryRegionCode = pcr.CountryRegionCode;
This case study has been prepared as a partial fulfillment for the Capstone project, the final course in Google Data Analytics offered by Google at the Coursera platform.
I created a dataset that contains source files I wrote to perform this analysis:
2022-08-04-bike-share-pres.pdf
- the final presentation of the results including diagrams, conclusions and recommendations, and2022-08-04-bike-share-report.pdf
- document describing all stages of the projectscripts
- R, bash, and SQL scripts I created and used for this projectspreadsheets
- spreadsheets I created and used for this projectThe original data regarding bike sharing program is available publicly. The link is provided in the presentation and in the report.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.
Dataset Overview
This dataset includes five CSV files:
patients.csv – Patient demographics, contact details, registration info, and insurance data
doctors.csv – Doctor profiles with specializations, experience, and contact information
appointments.csv – Appointment dates, times, visit reasons, and statuses
treatments.csv – Treatment types, descriptions, dates, and associated costs
billing.csv – Billing amounts, payment methods, and status linked to treatments
📁 Files & Column Descriptions
** patients.csv**
Contains patient demographic and registration details.
Column Description
patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address
** doctors.csv**
Details about the doctors working in the hospital.
Column Description
doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address
appointments.csv
Records of scheduled and completed patient appointments.
Column Description
appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)
treatments.csv
Information about the treatments given during appointments.
Column Description
treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given
** billing.csv**
Billing and payment details for treatments.
Column Description
bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)
Possible Use Cases
SQL queries and relational database design
Exploratory data analysis (EDA) and dashboarding
Machine learning projects (e.g., cost prediction, no-show analysis)
Feature engineering and data cleaning practice
End-to-end healthcare analytics workflows
Recommended Tools & Resources
SQL (joins, filters, window functions)
Pandas and Matplotlib/Seaborn for EDA
Scikit-learn for ML models
Pandas Profiling for automated EDA
Plotly for interactive visualizations
Please Note that :
All data is synthetically generated for educational and project use. No real patient information is included.
If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Docker containers are a standardized way of packaging applications and their execution environment in a reproducible manner. This dataset is an extension of an existing docker dataset with over 100,000 Dockerfiles in 15,000 projects (https://zenodo.org/record/1200869/).
This dataset was used to extract patching patterns for Dockerfiles with the goal of improving quality in an automatic fashion.
The extension of the original dataset includes:
Static analysis results of every version of every Dockerfile
Vulnerability data from analysing a limited amount of built Docker images
A second database containing quality patches based on the static analysis results
Files:
msr18_extended A compressed, binary PostgreSQL database dump of the docker dataset extended with analysis results
patch_database A compressed, binary PostgreSQL database dump of extracted patches
patch_datbase.sql A PostgreSQL plain SQL database dump of extracted patches
Furhter information on the artifact used to extract and apply patches and instructions to import the database dumps are provided here: https://github.com/mandoway/dfp
** One analysis Done in spreadsheets with 202004 and 202005 data **
To adjust for outlier Ride lengths like the max and min below: Max RL =MAX(N:N)978:40:02 minimum RL =MIN(N:N)-0:02:56
TRIMMean to shave off the top and bottom of a dataset. TRIMMEAN =TRIMMEAN(N:N,5%)0:20:20 =TRIMMEAN(N:N,2%)0:21:27
Otherwise the Ride length for 202004 is Average RL 0:35:51
The most common day of the week is Sunday. There are 61,148 members and 23,628 casual riders. mode of DOW 1 CountIf member of MC 61148 CountIf casual of MC 23628
Pivot table 1 2020-04 member_casual AVERAGE of ride_length
Same calculations for 2020-05 Average RL 0:33:23 Max RL 481:36:53 minimum RL -0:01:48 mode of DOW 7 CountIf member of MC 113365 CountIf casual of MC 86909 TRIMMEAN 0:25:22 0:26:59
There are 4 pivot tables included in seperate sheets for other comparisons.
I gathered this data using the sources provided by the Google Data Analytics course. All work seen is done by myself.
I want to further use the data in SQL, and Tableau.
The Sakila sample database is a fictitious database designed to represent a DVD rental store. The tables of the database include film, film_category, actor, customer, rental, payment and inventory among others. The Sakila sample database is intended to provide a standard schema that can be used for examples in books, tutorials, articles, samples, and so forth. Detailed information about the database can be found on the MySQL website: https://dev.mysql.com/doc/sakila/en/
Sakila for SQLite is a part of the sakila-sample-database-ports project intended to provide ported versions of the original MySQL database for other database systems, including:
Sakila for SQLite is a port of the Sakila example database available for MySQL, which was originally developed by Mike Hillyer of the MySQL AB documentation team. This project is designed to help database administrators to decide which database to use for development of new products The user can run the same SQL against different kind of databases and compare the performance
License: BSD Copyright DB Software Laboratory http://www.etl-tools.com
Note: Part of the insert scripts were generated by Advanced ETL Processor http://www.etl-tools.com/etl-tools/advanced-etl-processor-enterprise/overview.html
Information about the project and the downloadable files can be found at: https://code.google.com/archive/p/sakila-sample-database-ports/
Other versions and developments of the project can be found at: https://github.com/ivanceras/sakila/tree/master/sqlite-sakila-db
https://github.com/jOOQ/jOOQ/tree/main/jOOQ-examples/Sakila
Direct access to the MySQL Sakila database, which does not require installation of MySQL (queries can be typed directly in the browser), is provided on the phpMyAdmin demo version website: https://demo.phpmyadmin.net/master-config/
The files in the sqlite-sakila-db folder are the script files which can be used to generate the SQLite version of the database. For convenience, the script files have already been run in cmd to generate the sqlite-sakila.db file, as follows:
sqlite> .open sqlite-sakila.db
# creates the .db file
sqlite> .read sqlite-sakila-schema.sql
# creates the database schema
sqlite> .read sqlite-sakila-insert-data.sql
# inserts the data
Therefore, the sqlite-sakila.db file can be directly loaded into SQLite3 and queries can be directly executed. You can refer to my notebook for an overview of the database and a demonstration of SQL queries. Note: Data about the film_text table is not provided in the script files, thus the film_text table is empty. Instead the film_id, title and description fields are included in the film table. Moreover, the Sakila Sample Database has many versions, so an Entity Relationship Diagram (ERD) is provided to describe this specific version. You are advised to refer to the ERD to familiarise yourself with the structure of the database.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Najir 0123
Released under MIT