Facebook
TwitterExplore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.
You are expected to do the following:
Consider a significance level of 0.05 for all tests.
Facebook
TwitterData Description: The data at hand contains medical costs of people characterized by certain attributes. Domain: Healthcare Context: Leveraging customer information is paramount for most businesses. In the case of an insurance company, attributes of customers like the ones mentioned below can be crucial in making business decisions. Hence, knowing to explore and generate value out of such data can be an invaluable skill to have. Attribute Information: age: age of primary beneficiary sex: insurance contractor gender, female, male bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9 children: Number of children covered by health insurance / Number of dependents smoker: Smoking region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest. charges: Individual medical costs billed by health insurance. Learning Outcomes: Exploratory Data Analysis Practicing statistics using Python Hypothesis testing
Facebook
TwitterThis dataset was created by Damini Tiwari
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A realistic synthetic French insurance dataset specifically designed for practicing data cleaning, transformation, and analytics with PySpark and other big data tools. This dataset contains intentional data quality issues commonly found in real-world insurance data.
Perfect for practicing data cleaning and transformation:
2024-01-15, 15/01/2024, 01/15/20241250.50€, €1250.50, 1250.50 EUR, $1375.551250.501250.50 eurosM, F, Male, Female, empty strings150 HP, 150hp, 150 CV, 111 kW, missing valuesto_date() and date parsing functionsregexp_replace() for price cleaningwhen().otherwise() conditional logiccast() for data type conversionsfillna() and dropna() strategiesRealistic insurance business rules implemented: - Age-based premium adjustments - Geographic risk zone pricing - Product-specific claim patterns - Seasonal claim distributions - Client lifecycle status transitions
Intermediate - Suitable for learners with basic Python/SQL knowledge ready to tackle real-world data challenges.
Generated with realistic French business context and intentional quality issues for educational purposes. All data is synthetic and does not represent real individuals or companies.
Facebook
TwitterThe EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2
https://data.mendeley.com/datasets/992mh7dk9y/2
Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2
Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.
Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:
insurance_df = pd.read_csv('insurance_claims.csv')
Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.
Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.
Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.
Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).
Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.
Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.
Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).
Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.
Facebook
TwitterIn this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources
1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037
Number of Instances: 398 Number of Attributes: 9 including the class attribute
Attribute Information:
mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)
This data set consists of three types of entities:
I - The specification of an auto in terms of various characteristics
II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".
III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.
The analysis is divided into two parts:
Data Wrangling
Exploratory Data Analysis
Descriptive statistics
Groupby
Analysis of variance
Correlation
Correlation stats
Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Updated on 2021/07/01 Add python codes for data processing of the NASS yield data, PRISM climate data, RMA insurance data, agmip yield simulation data, and related geospatial analysis-------------------This repository is for the datasets and codes for “Excessive rainfall leads to maize yield loss of a comparable magnitude to extreme drought in the United States” (Li et al. 2019 GCB). The datasets include maize yield, climate data, maize yield loss from crop insurance data in the US, etc. The datasets can reproduce the results of the paper and they can also be reused to explore other topics. Please cite Li et al (2019) or acknowledge this dataset (DOI: 10.6084/m9.figshare.7581473) when using it.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset simulates the end-to-end operations of a California-based hospital for Q1 2025. It includes over 126,000 rows across 9 fully integrated tables that capture patient visits, clinical procedures, diagnoses, lab tests, medication prescriptions, provider details, billing, claims, and denials — designed for data analytics, machine learning, and healthcare research.
📦 Tables Included: patients.csv – Patient demographics, insurance, DOB, gender
encounters.csv – Admission/discharge details, visit types, departments
diagnoses.csv – ICD-10 diagnosis codes linked to encounters
procedures.csv – CPT/ICD-10-PCS procedure codes per patient
medications.csv – Drug names, dosages, prescription data
lab_tests.csv – Test names, result values, normal ranges
claims_and_billing.csv – Financial charges, insurance claims, payments
providers.csv – Doctors, specializations, provider roles
denials.csv – Reasons for claim denial, status, appeal info
This dataset was custom-built to reflect real-world healthcare challenges including:
Messy and missing data (for cleaning exercises)
Insurance claim workflows and denial patterns
Analysis of repeat admissions and chronic disease trends
Medication brand usage, cost patterns, and outcomes
🧠 Ideal For: Healthcare Data Science Projects
Revenue Cycle Management (RCM) analytics
Power BI & Tableau Dashboards
Machine Learning modeling (readmission, denial prediction, etc.)
Python/SQL Data Cleaning Practice
This dataset is completely synthetic and safe for public use. It was generated using custom rules, distributions, and logic reflective of real hospital operations.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterExplore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.
You are expected to do the following:
Consider a significance level of 0.05 for all tests.