9 datasets found

Axis Insurance
kaggle.com
zip
Updated Jan 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoor T. (2023). Axis Insurance [Dataset]. https://www.kaggle.com/datasets/hoorqasim/axis-insurance
Explore at:
zip(16433 bytes)Available download formats
Dataset updated
Jan 19, 2023
Authors
Hoor T.
Description
Explore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.

You are expected to do the following:

Explore the dataset and extract insights using Exploratory Data Analysis.

Prove (or disprove) that the medical claims made by the people who smoke is greater than those who don't? [Hint- - - Formulate a hypothesis and prove/disprove it]

Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.

Is the proportion of smokers significantly different across different regions? [Hint : Create a contingency table/cross tab, Use the function : stats.chi2_contingency()]

Is the mean BMI of women with no children, one child, and two children the same? Explain your answer with statistical evidence.

Consider a significance level of 0.05 for all tests.
Insurance dataset for statistical analysis
kaggle.com
zip
Updated Sep 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nazeeruddin (2020). Insurance dataset for statistical analysis [Dataset]. https://www.kaggle.com/datasets/nazeernazeer/insurance-dataset-for-statistical-analysis/code
Explore at:
zip(16425 bytes)Available download formats
Dataset updated
Sep 21, 2020
Authors
Nazeeruddin
Description
Data Description: The data at hand contains medical costs of people characterized by certain attributes. Domain: Healthcare Context: Leveraging customer information is paramount for most businesses. In the case of an insurance company, attributes of customers like the ones mentioned below can be crucial in making business decisions. Hence, knowing to explore and generate value out of such data can be an invaluable skill to have. Attribute Information: age: age of primary beneficiary sex: insurance contractor gender, female, male bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 children: Number of children covered by health insurance / Number of dependents smoker: Smoking region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest. charges: Individual medical costs billed by health insurance. Learning Outcomes:  Exploratory Data Analysis  Practicing statistics using Python  Hypothesis testing
Insurance(HealthCare)
kaggle.com
zip
Updated Jul 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damini Tiwari (2020). Insurance(HealthCare) [Dataset]. https://www.kaggle.com/datasets/daminitiwari/insurance/discussion
Explore at:
zip(16433 bytes)Available download formats
Dataset updated
Jul 27, 2020
Authors
Damini Tiwari
Description
Dataset

This dataset was created by Damini Tiwari

Contents
Insurance Dataset for Data Engineering Practice
kaggle.com
zip
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KPOVIESI Olaolouwa Amiche Stéphane (2025). Insurance Dataset for Data Engineering Practice [Dataset]. https://www.kaggle.com/datasets/kpoviesistphane/insurance-dataset-for-data-engineering-practice
Explore at:
zip(475362 bytes)Available download formats
Dataset updated
Sep 24, 2025
Authors
KPOVIESI Olaolouwa Amiche Stéphane
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Insurance Dataset for Data Engineering Practice

Overview

A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.

Dataset Contents

📊 Three Main Tables:

contracts.csv (~15,000 rows) - Insurance contracts with client information

claims.csv (~6,000 rows) - Insurance claims with damage and settlement details

vehicles.csv (~12,000 rows) - Vehicle information for auto insurance contracts

🗺️ Geographic Coverage:

French cities with realistic postal codes

Risk zone classifications (High/Medium/Low)

Regional pricing coefficients

🏷️ Product Types:

Auto Insurance (majority)

Home Insurance

Life Insurance

Health Insurance

🎯 Intentional Data Quality Issues

Perfect for practicing data cleaning and transformation:

Date Format Issues:

Mixed formats: 2024-01-15, 15/01/2024, 01/15/2024

String storage requiring parsing and standardization

Price Format Inconsistencies:

Multiple currency formats: 1250.50€, €1250.50, 1250.50 EUR, $1375.55

Missing currency symbols: 1250.50

Written formats: 1250.50 euros

Missing Data Patterns:

Strategic missingness in age (8%), CSP (12%), expert_id (20-25%)

Realistic patterns based on business logic

Categorical Inconsistencies:

Gender: M, F, Male, Female, empty strings

Power units: 150 HP, 150hp, 150 CV, 111 kW, missing values

Data Type Issues:

Numeric values stored as strings

Mixed data types requiring casting

🚀 Perfect for Practicing:

PySpark Operations:

to_date() and date parsing functions

regexp_replace() for price cleaning

when().otherwise() conditional logic

cast() for data type conversions

fillna() and dropna() strategies

Data Engineering Tasks:

ETL pipeline development

Data validation and quality checks

Join operations across related tables

Aggregation with business logic

Data standardization workflows

Analytics & ML:

Customer segmentation

Claim frequency analysis

Premium pricing models

Risk assessment by geography

Churn prediction

🏢 Business Context

Realistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions

💡 Use Cases:

Data Engineering Bootcamps: Hands-on PySpark practice

SQL Training: Complex joins and aggregations

Data Science Projects: End-to-end ML pipeline development

Business Intelligence: Dashboard and reporting practice

Data Quality Workshops: Cleaning and validation techniques

🔧 Tools Compatibility:

Apache Spark / PySpark

Pandas / Python

SQL databases

Databricks

Google Cloud Dataflow

AWS Glue

📈 Difficulty Level:

Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.

Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.
Datasets for manuscript "A data engineering framework for chemical flow...
catalog.data.gov
gimi9.com
Updated Nov 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
Explore at:
Dataset updated
Nov 7, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Insurance_claims
kaggle.com
data.mendeley.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
Explore at:
zip(68984 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
Miannotti
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

https://data.mendeley.com/datasets/992mh7dk9y/2

Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

Load the Dataset File

insurance_df = pd.read_csv('insurance_claims.csv')

Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.
UCI Automobile Dataset
kaggle.com
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Otrivedi
Description
In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Number of Instances: 398 Number of Attributes: 9 including the class attribute

Attribute Information:

mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

This data set consists of three types of entities:

I - The specification of an auto in terms of various characteristics

II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

The analysis is divided into two parts:

Data Wrangling

Pre-processing data in python

Dealing with missing values

Data formatting

Data normalization

Binning

Exploratory Data Analysis

Descriptive statistics

Groupby

Analysis of variance

Correlation

Correlation stats

Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Datasets for "Excessive rainfall leads to maize yield loss of a comparable...
figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Li; Kaiyu Guan (2023). Datasets for "Excessive rainfall leads to maize yield loss of a comparable magnitude to extreme drought in the United States" [Dataset]. http://doi.org/10.6084/m9.figshare.7581473.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7581473.v2
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yan Li; Kaiyu Guan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States
Description
Updated on 2021/07/01 Add python codes for data processing of the NASS yield data, PRISM climate data, RMA insurance data, agmip yield simulation data, and related geospatial analysis-------------------This repository is for the datasets and codes for “Excessive rainfall leads to maize yield loss of a comparable magnitude to extreme drought in the United States” (Li et al. 2019 GCB). The datasets include maize yield, climate data, maize yield loss from crop insurance data in the US, etc. The datasets can reproduce the results of the paper and they can also be reused to explore other topics. Please cite Li et al (2019) or acknowledge this dataset (DOI: 10.6084/m9.figshare.7581473) when using it.
CA Hospital Dataset – Q1 2025
kaggle.com
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajkumar K P (2025). CA Hospital Dataset – Q1 2025 [Dataset]. https://www.kaggle.com/datasets/rajkumarpadmanabhan/ca-hospital-dataset-q1-2025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rajkumar K P
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset simulates the end-to-end operations of a California-based hospital for Q1 2025. It includes over 126,000 rows across 9 fully integrated tables that capture patient visits, clinical procedures, diagnoses, lab tests, medication prescriptions, provider details, billing, claims, and denials — designed for data analytics, machine learning, and healthcare research.

📦 Tables Included: patients.csv – Patient demographics, insurance, DOB, gender

encounters.csv – Admission/discharge details, visit types, departments

diagnoses.csv – ICD-10 diagnosis codes linked to encounters

procedures.csv – CPT/ICD-10-PCS procedure codes per patient

medications.csv – Drug names, dosages, prescription data

lab_tests.csv – Test names, result values, normal ranges

claims_and_billing.csv – Financial charges, insurance claims, payments

providers.csv – Doctors, specializations, provider roles

denials.csv – Reasons for claim denial, status, appeal info

This dataset was custom-built to reflect real-world healthcare challenges including:

Messy and missing data (for cleaning exercises)

Insurance claim workflows and denial patterns

Analysis of repeat admissions and chronic disease trends

Medication brand usage, cost patterns, and outcomes

🧠 Ideal For: Healthcare Data Science Projects

Revenue Cycle Management (RCM) analytics

Power BI & Tableau Dashboards

Machine Learning modeling (readmission, denial prediction, etc.)

Python/SQL Data Cleaning Practice

This dataset is completely synthetic and safe for public use. It was generated using custom rules, distributions, and logic reflective of real hospital operations.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Hoor T. (2023). Axis Insurance [Dataset]. https://www.kaggle.com/datasets/hoorqasim/axis-insurance

Axis Insurance

Statistical Analysis of Business Data - Case Study #2

Explore at:

76 scholarly articles cite this dataset (View in Google Scholar)

zip(16433 bytes)Available download formats

Dataset updated

Jan 19, 2023

Authors

Hoor T.

Description

Explore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.

You are expected to do the following:

Explore the dataset and extract insights using Exploratory Data Analysis.
Prove (or disprove) that the medical claims made by the people who smoke is greater than those who don't? [Hint- - - Formulate a hypothesis and prove/disprove it]
Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.
Is the proportion of smokers significantly different across different regions? [Hint : Create a contingency table/cross tab, Use the function : stats.chi2_contingency()]
Is the mean BMI of women with no children, one child, and two children the same? Explain your answer with statistical evidence.

Consider a significance level of 0.05 for all tests.

Clear search

Close search

Google apps

Main menu

Axis Insurance

Insurance dataset for statistical analysis

Insurance(HealthCare)

Dataset

Contents

Insurance Dataset for Data Engineering Practice

Insurance Dataset for Data Engineering Practice

Overview

Dataset Contents

📊 Three Main Tables:

🗺️ Geographic Coverage:

🏷️ Product Types:

🎯 Intentional Data Quality Issues

Date Format Issues:

Price Format Inconsistencies:

Missing Data Patterns:

Categorical Inconsistencies:

Data Type Issues:

🚀 Perfect for Practicing:

PySpark Operations:

Data Engineering Tasks:

Analytics & ML:

🏢 Business Context

💡 Use Cases:

🔧 Tools Compatibility:

📈 Difficulty Level:

Datasets for manuscript "A data engineering framework for chemical flow...

Insurance_claims

Load the Dataset File

UCI Automobile Dataset

Datasets for "Excessive rainfall leads to maize yield loss of a comparable...

CA Hospital Dataset – Q1 2025

Axis Insurance

Statistical Analysis of Business Data - Case Study #2