9 datasets found
  1. Axis Insurance

    • kaggle.com
    zip
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoor T. (2023). Axis Insurance [Dataset]. https://www.kaggle.com/datasets/hoorqasim/axis-insurance
    Explore at:
    zip(16433 bytes)Available download formats
    Dataset updated
    Jan 19, 2023
    Authors
    Hoor T.
    Description

    Explore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.

    You are expected to do the following:

    • Explore the dataset and extract insights using Exploratory Data Analysis.
    • Prove (or disprove) that the medical claims made by the people who smoke is greater than those who don't? [Hint- - - Formulate a hypothesis and prove/disprove it]
    • Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.
    • Is the proportion of smokers significantly different across different regions? [Hint : Create a contingency table/cross tab, Use the function : stats.chi2_contingency()]
    • Is the mean BMI of women with no children, one child, and two children the same? Explain your answer with statistical evidence.

    Consider a significance level of 0.05 for all tests.

  2. Insurance dataset for statistical analysis

    • kaggle.com
    zip
    Updated Sep 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nazeeruddin (2020). Insurance dataset for statistical analysis [Dataset]. https://www.kaggle.com/datasets/nazeernazeer/insurance-dataset-for-statistical-analysis/code
    Explore at:
    zip(16425 bytes)Available download formats
    Dataset updated
    Sep 21, 2020
    Authors
    Nazeeruddin
    Description

    Data Description: The data at hand contains medical costs of people characterized by certain attributes. Domain: Healthcare Context: Leveraging customer information is paramount for most businesses. In the case of an insurance company, attributes of customers like the ones mentioned below can be crucial in making business decisions. Hence, knowing to explore and generate value out of such data can be an invaluable skill to have. Attribute Information: age: age of primary beneficiary sex: insurance contractor gender, female, male bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 children: Number of children covered by health insurance / Number of dependents smoker: Smoking region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest. charges: Individual medical costs billed by health insurance. Learning Outcomes:  Exploratory Data Analysis  Practicing statistics using Python  Hypothesis testing

  3. Insurance(HealthCare)

    • kaggle.com
    zip
    Updated Jul 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damini Tiwari (2020). Insurance(HealthCare) [Dataset]. https://www.kaggle.com/datasets/daminitiwari/insurance/discussion
    Explore at:
    zip(16433 bytes)Available download formats
    Dataset updated
    Jul 27, 2020
    Authors
    Damini Tiwari
    Description

    Dataset

    This dataset was created by Damini Tiwari

    Contents

  4. Insurance Dataset for Data Engineering Practice

    • kaggle.com
    zip
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KPOVIESI Olaolouwa Amiche Stéphane (2025). Insurance Dataset for Data Engineering Practice [Dataset]. https://www.kaggle.com/datasets/kpoviesistphane/insurance-dataset-for-data-engineering-practice
    Explore at:
    zip(475362 bytes)Available download formats
    Dataset updated
    Sep 24, 2025
    Authors
    KPOVIESI Olaolouwa Amiche Stéphane
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Insurance Dataset for Data Engineering Practice

    Overview

    A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.

    Dataset Contents

    📊 Three Main Tables:

    • contracts.csv (~15,000 rows) - Insurance contracts with client information
    • claims.csv (~6,000 rows) - Insurance claims with damage and settlement details
    • vehicles.csv (~12,000 rows) - Vehicle information for auto insurance contracts

    🗺️ Geographic Coverage:

    • French cities with realistic postal codes
    • Risk zone classifications (High/Medium/Low)
    • Regional pricing coefficients

    🏷️ Product Types:

    • Auto Insurance (majority)
    • Home Insurance
    • Life Insurance
    • Health Insurance

    🎯 Intentional Data Quality Issues

    Perfect for practicing data cleaning and transformation:

    Date Format Issues:

    • Mixed formats: 2024-01-15, 15/01/2024, 01/15/2024
    • String storage requiring parsing and standardization

    Price Format Inconsistencies:

    • Multiple currency formats: 1250.50€, €1250.50, 1250.50 EUR, $1375.55
    • Missing currency symbols: 1250.50
    • Written formats: 1250.50 euros

    Missing Data Patterns:

    • Strategic missingness in age (8%), CSP (12%), expert_id (20-25%)
    • Realistic patterns based on business logic

    Categorical Inconsistencies:

    • Gender: M, F, Male, Female, empty strings
    • Power units: 150 HP, 150hp, 150 CV, 111 kW, missing values

    Data Type Issues:

    • Numeric values stored as strings
    • Mixed data types requiring casting

    🚀 Perfect for Practicing:

    PySpark Operations:

    • to_date() and date parsing functions
    • regexp_replace() for price cleaning
    • when().otherwise() conditional logic
    • cast() for data type conversions
    • fillna() and dropna() strategies

    Data Engineering Tasks:

    • ETL pipeline development
    • Data validation and quality checks
    • Join operations across related tables
    • Aggregation with business logic
    • Data standardization workflows

    Analytics & ML:

    • Customer segmentation
    • Claim frequency analysis
    • Premium pricing models
    • Risk assessment by geography
    • Churn prediction

    🏢 Business Context

    Realistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions

    💡 Use Cases:

    • Data Engineering Bootcamps: Hands-on PySpark practice
    • SQL Training: Complex joins and aggregations
    • Data Science Projects: End-to-end ML pipeline development
    • Business Intelligence: Dashboard and reporting practice
    • Data Quality Workshops: Cleaning and validation techniques

    🔧 Tools Compatibility:

    • Apache Spark / PySpark
    • Pandas / Python
    • SQL databases
    • Databricks
    • Google Cloud Dataflow
    • AWS Glue

    📈 Difficulty Level:

    Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.

    Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.

  5. Datasets for manuscript "A data engineering framework for chemical flow...

    • catalog.data.gov
    • gimi9.com
    Updated Nov 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
    Explore at:
    Dataset updated
    Nov 7, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).

  6. Insurance_claims

    • kaggle.com
    • data.mendeley.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
    Explore at:
    zip(68984 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Miannotti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

    https://data.mendeley.com/datasets/992mh7dk9y/2

    Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

    Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

    Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

    Load the Dataset File

    insurance_df = pd.read_csv('insurance_claims.csv')

    • Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

    Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

    Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

    Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

    Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

    Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

    Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

    Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

    Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.

  7. UCI Automobile Dataset

    • kaggle.com
    Updated Feb 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Otrivedi
    Description

    In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

    This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

    1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

    Number of Instances: 398 Number of Attributes: 9 including the class attribute

    Attribute Information:

    mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

    This data set consists of three types of entities:

    I - The specification of an auto in terms of various characteristics

    II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

    III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

    The analysis is divided into two parts:

    Data Wrangling

    1. Pre-processing data in python
    2. Dealing with missing values
    3. Data formatting
    4. Data normalization
    5. Binning
    6. Exploratory Data Analysis

    7. Descriptive statistics

    8. Groupby

    9. Analysis of variance

    10. Correlation

    11. Correlation stats

    Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

  8. Datasets for "Excessive rainfall leads to maize yield loss of a comparable...

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan Li; Kaiyu Guan (2023). Datasets for "Excessive rainfall leads to maize yield loss of a comparable magnitude to extreme drought in the United States" [Dataset]. http://doi.org/10.6084/m9.figshare.7581473.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Yan Li; Kaiyu Guan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Updated on 2021/07/01 Add python codes for data processing of the NASS yield data, PRISM climate data, RMA insurance data, agmip yield simulation data, and related geospatial analysis-------------------This repository is for the datasets and codes for “Excessive rainfall leads to maize yield loss of a comparable magnitude to extreme drought in the United States” (Li et al. 2019 GCB). The datasets include maize yield, climate data, maize yield loss from crop insurance data in the US, etc. The datasets can reproduce the results of the paper and they can also be reused to explore other topics. Please cite Li et al (2019) or acknowledge this dataset (DOI: 10.6084/m9.figshare.7581473) when using it.

  9. CA Hospital Dataset – Q1 2025

    • kaggle.com
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajkumar K P (2025). CA Hospital Dataset – Q1 2025 [Dataset]. https://www.kaggle.com/datasets/rajkumarpadmanabhan/ca-hospital-dataset-q1-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rajkumar K P
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset simulates the end-to-end operations of a California-based hospital for Q1 2025. It includes over 126,000 rows across 9 fully integrated tables that capture patient visits, clinical procedures, diagnoses, lab tests, medication prescriptions, provider details, billing, claims, and denials — designed for data analytics, machine learning, and healthcare research.

    📦 Tables Included: patients.csv – Patient demographics, insurance, DOB, gender

    encounters.csv – Admission/discharge details, visit types, departments

    diagnoses.csv – ICD-10 diagnosis codes linked to encounters

    procedures.csv – CPT/ICD-10-PCS procedure codes per patient

    medications.csv – Drug names, dosages, prescription data

    lab_tests.csv – Test names, result values, normal ranges

    claims_and_billing.csv – Financial charges, insurance claims, payments

    providers.csv – Doctors, specializations, provider roles

    denials.csv – Reasons for claim denial, status, appeal info

    This dataset was custom-built to reflect real-world healthcare challenges including:

    Messy and missing data (for cleaning exercises)

    Insurance claim workflows and denial patterns

    Analysis of repeat admissions and chronic disease trends

    Medication brand usage, cost patterns, and outcomes

    🧠 Ideal For: Healthcare Data Science Projects

    Revenue Cycle Management (RCM) analytics

    Power BI & Tableau Dashboards

    Machine Learning modeling (readmission, denial prediction, etc.)

    Python/SQL Data Cleaning Practice

    This dataset is completely synthetic and safe for public use. It was generated using custom rules, distributions, and logic reflective of real hospital operations.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hoor T. (2023). Axis Insurance [Dataset]. https://www.kaggle.com/datasets/hoorqasim/axis-insurance
Organization logo

Axis Insurance

Statistical Analysis of Business Data - Case Study #2

Explore at:
76 scholarly articles cite this dataset (View in Google Scholar)
zip(16433 bytes)Available download formats
Dataset updated
Jan 19, 2023
Authors
Hoor T.
Description

Explore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.

You are expected to do the following:

  • Explore the dataset and extract insights using Exploratory Data Analysis.
  • Prove (or disprove) that the medical claims made by the people who smoke is greater than those who don't? [Hint- - - Formulate a hypothesis and prove/disprove it]
  • Prove (or disprove) with statistical evidence that the BMI of females is different from that of males.
  • Is the proportion of smokers significantly different across different regions? [Hint : Create a contingency table/cross tab, Use the function : stats.chi2_contingency()]
  • Is the mean BMI of women with no children, one child, and two children the same? Explain your answer with statistical evidence.

Consider a significance level of 0.05 for all tests.

Search
Clear search
Close search
Google apps
Main menu