https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.
Key Objectives:
Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:
SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:
Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Northwind Database
La base de datos Northwind es una base de datos de muestra creada originalmente por Microsoft y utilizada como base para sus tutoriales en una variedad de productos de bases de datos durante décadas. La base de datos de Northwind contiene datos de ventas de una empresa ficticia llamada "Northwind Traders", que importa y exporta alimentos especiales de todo el mundo. La base de datos Northwind es un excelente esquema tutorial para un ERP de pequeñas empresas, con clientes, pedidos, inventario, compras, proveedores, envíos, empleados y contabilidad de entrada única. Desde entonces, la base de datos Northwind ha sido trasladada a una variedad de bases de datos que no son de Microsoft, incluido PostgreSQL.
El conjunto de datos de Northwind incluye datos de muestra para lo siguiente.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fa52a5bbc3d8842abfdfcfe608b7a8d25%2FNorthwind_E-R_Diagram.png?generation=1718785485874540&alt=media" alt="">
Chinook DataBase
Chinook es una base de datos de muestra disponible para SQL Server, Oracle, MySQL, etc. Se puede crear ejecutando un único script SQL. La base de datos Chinook es una alternativa a la base de datos Northwind, siendo ideal para demostraciones y pruebas de herramientas ORM dirigidas a servidores de bases de datos únicos o múltiples.
El modelo de datos Chinook representa una tienda de medios digitales, que incluye tablas para artistas, álbumes, pistas multimedia, facturas y clientes.
Los datos relacionados con los medios se crearon utilizando datos reales de una biblioteca de iTunes. La información de clientes y empleados se creó manualmente utilizando nombres ficticios, direcciones que se pueden ubicar en mapas de Google y otros datos bien formateados (teléfono, fax, correo electrónico, etc.). La información de ventas se genera automáticamente utilizando datos aleatorios durante un período de cuatro años.
¿Por qué el nombre Chinook? El nombre de esta base de datos de ejemplo se basó en la base de datos Northwind. Los chinooks son vientos en el interior oeste de América del Norte, donde las praderas canadienses y las grandes llanuras se encuentran con varias cadenas montañosas. Los chinooks son más frecuentes en el sur de Alberta en Canadá. Chinook es una buena opción de nombre para una base de datos que pretende ser una alternativa a Northwind.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fd856e0358e3a572d50f1aba5e171c1c6%2FChinook%20DataBase.png?generation=1718785749657445&alt=media" alt="">
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a structured, multi-table dataset designed to simulate a hospital management system. It is ideal for practicing data analysis, SQL, machine learning, and healthcare analytics.
Dataset Overview
This dataset includes five CSV files:
patients.csv – Patient demographics, contact details, registration info, and insurance data
doctors.csv – Doctor profiles with specializations, experience, and contact information
appointments.csv – Appointment dates, times, visit reasons, and statuses
treatments.csv – Treatment types, descriptions, dates, and associated costs
billing.csv – Billing amounts, payment methods, and status linked to treatments
📁 Files & Column Descriptions
** patients.csv**
Contains patient demographic and registration details.
Column Description
patient_id -> Unique ID for each patient first_name -> Patient's first name last_name -> Patient's last name gender -> Gender (M/F) date_of_birth -> Date of birth contact_number -> Phone number address -> Address of the patient registration_date -> Date of first registration at the hospital insurance_provider -> Insurance company name insurance_number -> Policy number email -> Email address
** doctors.csv**
Details about the doctors working in the hospital.
Column Description
doctor_id -> Unique ID for each doctor first_name -> Doctor's first name last_name -> Doctor's last name specialization -> Medical field of expertise phone_number -> Contact number years_experience -> Total years of experience hospital_branch -> Branch of hospital where doctor is based email -> Official email address
appointments.csv
Records of scheduled and completed patient appointments.
Column Description
appointment_id -> Unique appointment ID patient_id -> ID of the patient doctor_id -> ID of the attending doctor appointment_date -> Date of the appointment appointment_time -> Time of the appointment reason_for_visit -> Purpose of visit (e.g., checkup) status -> Status (Scheduled, Completed, Cancelled)
treatments.csv
Information about the treatments given during appointments.
Column Description
treatment_id -> Unique ID for each treatment appointment_id -> Associated appointment ID treatment_type -> Type of treatment (e.g., MRI, X-ray) description -> Notes or procedure details cost -> Cost of treatment treatment_date -> Date when treatment was given
** billing.csv**
Billing and payment details for treatments.
Column Description
bill_id -> Unique billing ID patient_id -> ID of the billed patient treatment_id -> ID of the related treatment bill_date -> Date of billing amount -> Total amount billed payment_method -> Mode of payment (Cash, Card, Insurance) payment_status -> Status of payment (Paid, Pending, Failed)
Possible Use Cases
SQL queries and relational database design
Exploratory data analysis (EDA) and dashboarding
Machine learning projects (e.g., cost prediction, no-show analysis)
Feature engineering and data cleaning practice
End-to-end healthcare analytics workflows
Recommended Tools & Resources
SQL (joins, filters, window functions)
Pandas and Matplotlib/Seaborn for EDA
Scikit-learn for ML models
Pandas Profiling for automated EDA
Plotly for interactive visualizations
Please Note that :
All data is synthetically generated for educational and project use. No real patient information is included.
If you find this dataset helpful, consider upvoting or sharing your insights by creating a Kaggle notebook.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset Overview The dataset consists of 26,000 job listings, extracted from a Taiwanese job search platform, focusing on software-related careers. Each listing is detailed with various attributes, providing a comprehensive view of the job market in this sector. Here's a breakdown of the dataset columns:
職缺類別 (Job Category) 職位類別 (Position Category) 職位 (Position) 縣市 (City/County) 地區 (District/Area) 供需人數 (應徵人數) (Number of Applicants) 公司名稱 (Company Name) 職缺名稱 (Job Title) 工作內容 (Job Description) 職務類別 (Job Type) 工作待遇 (Salary) 工作性質 (Nature of Work) 上班地點 (Work Location) 管理責任 (Management Responsibility) 上班時段 (Working Hours) 需求人數 (Number of Positions) 工作經歷 (Work Experience) 學歷要求 (Educational Requirements) 科系要求 (Departmental Requirements) 擅長工具 (Tools Proficiency) 工作技能 (Job Skills) 其他條件 (Other Conditions) 資本額 (Capital Amount) 員工人數 (Number of Employees) 公司標籤 (Company Tags) Analytical Insights Exploratory Data Analysis Perform exploratory data analysis using libraries like Pandas and NumPy. Examine trends in job categories, salaries, and educational requirements. Analyze the distribution of jobs across different cities and districts. Visualization Create visual representations of the dataset using Python visualization libraries. Plot job distribution across various sectors or locations. Visualize salary ranges and compare them with educational and experience requirements. Practice with SQL or Pandas Queries Utilize the dataset to refine SQL query skills or Pandas data manipulation techniques. Execute queries to extract specific information, such as the most in-demand skills or the companies offering the highest salaries. NLP Analysis and Tasks for Software Jobs Dataset This dataset, encompassing 26,000 job listings from the Taiwanese software industry, is ripe for a variety of Natural Language Processing (NLP) analyses. Below are some recommended NLP tasks and analyses that can be conducted on this dataset.
Text Classification Job Category Prediction: Train a classification model to predict the job category (職缺類別) using job descriptions (工作內容). Salary Range Classification: Classify jobs into different salary brackets based on their descriptions and titles, helping to identify features associated with higher salaries. Sentiment Analysis Company Reputation Analysis: Analyze the sentiment of company tags (公司標籤) to assess the general sentiment or reputation of companies listed in the dataset. Topic Modeling Identifying Key Job Requirements: Apply LDA (Latent Dirichlet Allocation) to job descriptions for uncovering common themes or required skills in the software sector. Named Entity Recognition (NER) Information Extraction: Implement NER to extract specific entities like tools (擅長工具), skills (工作技能), and educational qualifications (學歷要求) from job descriptions. Text Summarization Summarizing Job Descriptions: Develop algorithms for generating concise summaries of job descriptions, enabling quick understanding of key points. Language Modeling Job Description Generation: Use language models to create realistic job descriptions based on input prompts, assisting in job listing creation or understanding industry language trends. Machine Translation (If Applicable) Dataset Translation for Global Accessibility: Translate the dataset content into English or other languages for international accessibility, using machine translation models. Predictive Analysis Predicting Applicant Volume: Use historical data to forecast the number of applicants (供需人數 (應徵人數)) a job listing might attract based on various factors. By leveraging these NLP techniques, insightful findings can be extracted from the dataset, beneficial for both job seekers and employers in the software field. This dataset offers a practical opportunity to apply NLP skills in a real-world setting.
CC0
Original Data Source: Taiwan 104.com jobs search JD
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Overview of the Dataset
The UFO sightings dataset contains records of UFO sightings reported globally since 1906. The dataset includes the following columns:
datetime: The date and time of the sighting.
day: The day of the week when the sighting occurred.
city: The city where the sighting was reported.
state: The state or region where the sighting occurred.
country: The country where the sighting was reported.
shape: The shape or form of the UFO observed.
duration (seconds): The duration of the sighting in seconds.
duration (hours/min): The duration of the sighting in hours and minutes.
comments: Additional comments or descriptions provided by the witness.
day_posted: The day the sighting was reported or posted.
date posted: The date the sighting was reported or posted.
latitude: The latitude coordinate of the sighting location.
longitude: The longitude coordinate of the sighting location.
days_count: The number of days between the sighting and when it was posted
Analysis Process
Data Cleaning and Preparation (Excel):
Removed duplicate entries and handled missing values.
Standardized formats for dates, times, and categorical variables (e.g., shapes, countries).
Calculated additional metrics such as days_count (time between sighting and posting).
Exploratory Data Analysis (SQL):
Aggregated data to analyze trends, such as the number of sightings per country, state, or city.
Calculated average durations of sightings by UFO shape.
Identified the most common UFO shapes and their distribution across countries.
Analyzed temporal trends, such as sightings per day or over time.
Visualization (Tableau):
Created interactive dashboards to visualize key insights.
Developed charts such as:
Average Duration of Sightings by Shape: Highlighting which UFO shapes were observed for the longest durations.
UFO Shapes by Country: Showing the distribution of UFO shapes across different countries.
UFO Shapes Total: A global overview of the most commonly reported UFO shapes.
UFO Sightings in All Countries: A map or bar chart showing the number of sightings per country.
UFO Sightings per Day: A time series analysis of sightings over days.
UFO Sightings in the USA: A focused analysis of sightings in the United States, broken down by state or city.
Key Insights and Conclusions
Most Common UFO Shapes:
The most frequently reported UFO shapes include lights, circles, and triangles.
These shapes are consistent across multiple countries, suggesting common patterns in UFO sightings.
Geographical Distribution:
The United States has the highest number of reported UFO sightings, followed by Canada and the United Kingdom.
Within the U.S., states like California, Florida, and Texas report the most sightings.
Temporal Trends:
Sightings have increased significantly since the mid-20th century, with a peak in the 2000s.
Certain days of the week (e.g., weekends) show higher reporting rates, possibly due to increased outdoor activity.
Duration of Sightings:
The average duration of sightings varies by shape. For example, cigar-shaped UFOs tend to be observed for longer periods compared to light or disk shapes.
Most sightings last less than a minute, but some reports describe durations of several hours.
Reporting Delays:
The days_count column reveals that many sightings are reported weeks or even months after they occur, indicating potential delays in witness reporting or data collection.
Global Patterns:
While the U.S. dominates the dataset, other countries show unique patterns in terms of UFO shapes and sighting frequencies.
For example, Australia and Germany report a higher proportion of triangular UFOs compared to other shapes.
Recommendations for Further Analysis
Geospatial Analysis: Use latitude and longitude data to create heatmaps of sightings and identify potential hotspots.
Text Analysis: Analyze the comments column using natural language processing (NLP) to extract common themes or keywords.
Correlation with External Data: Investigate whether UFO sightings correlate with astronomical events, military activity, or other phenomena.
Machine Learning: Build predictive models to identify patterns or classify sightings based on shape, duration, or location.
Conclusion
The UFO sightings dataset provides a fascinating glimpse into global reports of unidentified flying objects. Through careful analysis, I identified key trends in UFO shapes, durations, and geographical distribution. The United States emerges as the epicenter of UFO sightings, with lights and ...
As a data analyst for Zuber, a new ride-sharing company that's launching in Chicago, our task is to find patterns in the available information. We want to understand passenger preferences and the impact of external factors on rides. Prior to that, we have conducted data extraction from SQL databases and several data analysis tasks on a separate platform. Now we have been provided with two datasets from previous SQL tasks: - project_sql_result_01.csv => which contains the name and number of rides made by each company on November 15-16, 2017 - project_sql_result_04.csv => which contains the name of the neighborhood where a ride ended and the average number of rides ended in each neighborhood - project_sql_result_07.csv => which contains starting time of ride, weather condition at the start, and duration of the ride.
Our step by step tasks are as follow: - Load and check data; - Exploratory Data Analysis - Test hypothesis: "The average duration of rides from Loop neighborhood to O'Hare International Airport changes on rainy Saturdays."
This (project) dataset has been provided during my Data science Bootcamp by Practicum/Yandex. For more information about Practicum by Yandex, please follow the link.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.
Key Objectives:
Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:
SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:
Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.