36 datasets found

Data-preprocessing-Imputation
kaggle.com
zip
Updated May 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pankesh Patel (2019). Data-preprocessing-Imputation [Dataset]. https://www.kaggle.com/pankeshpatel/datapreprocessingimputation
Explore at:
zip(327 bytes)Available download formats
Dataset updated
May 18, 2019
Authors
Pankesh Patel
Description
Dataset

This dataset was created by Pankesh Patel

Contents
Proteomics Data Preprocessing Simulation, KNN PCA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Proteomics Data Preprocessing Simulation, KNN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/proteomics-data-preprocessing-simulation-knn-pca
Explore at:
zip(24051 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a simulation of proteomics data preprocessing workflows.

It focuses on the application of K-Nearest Neighbors (KNN) imputation to handle missing values.

Principal Component Analysis (PCA) is applied for dimensionality reduction and visualization of high-dimensional proteomics data.

The dataset demonstrates an end-to-end preprocessing pipeline for proteomics datasets.

Includes synthetic or real-like proteomics data suitable for educational and research purposes.

Designed to help researchers, bioinformaticians, and data scientists learn preprocessing techniques.

Highlights the impact of missing data handling and normalization on downstream analysis.

Aims to improve reproducibility of proteomics data analysis through a structured workflow.

Useful for testing machine learning models on clean and preprocessed proteomics data.

Supports hands-on learning for KNN imputation, PCA, and data visualization techniques.

Helps users understand the significance of preprocessing in high-throughput biological data analysis.

Provides code and explanations for a complete pipeline from raw data to PCA visualization.
Synthetic Stroke Prediction Dataset
kaggle.com
data.mendeley.com
zip
Updated May 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/mohammedborhanuddin/synthetic-stroke-prediction-dataset
Explore at:
zip(682120 bytes)Available download formats
Dataset updated
May 2, 2025
Authors
Mohammed Borhan Uddin
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.

Cite the dataset using : Uddin, Mohammed Borhan (2025), “Synthetic Stroke Prediction Dataset”, Mendeley Data, V1, doi: 10.17632/s2nh6fm925.1
Personal Information and Life Status Dataset
kaggle.com
zip
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onur Kasap (2025). Personal Information and Life Status Dataset [Dataset]. https://www.kaggle.com/datasets/onurkasapdev/personal-information-and-life-status-dataset
Explore at:
zip(3276 bytes)Available download formats
Dataset updated
Sep 24, 2025
Authors
Onur Kasap
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Personal Information and Life Status Dataset

This is a synthetic dataset containing various personal and life status details of individuals, structured in a table with 100 different rows. The primary purpose of this dataset is to serve as a beginner-friendly resource for data science, machine learning, and data visualization projects. The data has been generated with a focus on consistency and realism, but it intentionally includes missing (None) and mistyped (typo) values in some features to highlight the importance of data preprocessing.

Dataset Content

The dataset consists of 14 columns, with each row representing an individual:

FirstName: The individual's first name. (String)

LastName: The individual's last name. (String)

Age: The individual's age. Some values are missing. (Integer)

Country: The individual's country of residence. Primarily includes developed countries and Türkiye. Some values may contain typos. (String)

Marital: Marital status. (Married, Single, Divorced) (String)

Education: Education level. Some values are missing. (High School, Bachelor's Degree, Master's Degree, PhD) (String)

Wages: Annual gross wages. Some values are missing. (Integer)

WorkHours: Weekly working hours. Some values are missing. (Integer)

SmokeStatus: Smoking status. (Smoker, Non-smoker) (String)

CarLicense: Possession of a driver's license. (Yes, No) (String)

VeganStatus: Vegan status. Some values are missing. (Yes, No) (String)

HolidayStatus: Holiday status. Some values are missing. (Yes, No) (String)

SportStatus: Sports activity level. (Active, Inactive) (String)

Score: A general life score for the individual. This is a synthetic value randomly assigned based on other features. Some values are missing. (Integer)

Potential Use Cases

This dataset is an ideal resource for various types of analysis, including but not limited to:

Data Science and Machine Learning: Applying data preprocessing techniques such as imputation for missing values, outlier detection, and categorical encoding. Subsequently, you can build regression models to predict values like wages or score, or classification models to categorize individuals.

Data Visualization: Creating interactive charts to show the relationship between education level and wages, the distribution of working hours by age, or the correlation between smoking status and overall life score.

Exploratory Data Analysis (EDA): Exploring average wage differences across countries, sports habits based on marital status, or the link between education level and having a car license.

Acknowledgement

We encourage you to share your work and findings after using this dataset. Your feedback is always welcome and will help us improve the quality of our datasets.
s
Data from: GoiEner smart meters data
research.science.eus
observatorio-cientifico.ua.es
+1more
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris (2022). GoiEner smart meters data [Dataset]. https://research.science.eus/documentos/668fc48cb9e7c03b01be0b72
Explore at:
Dataset updated
2022
Authors
Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris; Granja, Carlos Quesada; Hernández, Cruz Enrique Borges; Astigarraga, Leire; Merveille, Chris
Description
Name: GoiEner smart meters data Summary: The dataset contains hourly time series of electricity consumption (kWh) provided by the Spanish electricity retailer GoiEner. The time series are arranged in four compressed files: raw.tzst, contains raw time series of all GoiEner clients (any date, any length, may have missing samples). imp-pre.tzst, contains processed time series (imputation of missing samples), longer than one year, collected before March 1, 2020. imp-in.tzst, contains processed time series (imputation of missing samples), longer than one year, collected between March 1, 2020 and May 30, 2021. imp-post.tzst, contains processed time series (imputation of missing samples), longer than one year, collected after May 30, 2020. metadata.csv, contains relevant information for each time series. License: CC-BY-SA Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943. Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein. Collection Date: From November 2, 2014 to June 8, 2022. Publication Date: December 1, 2022. DOI: 10.5281/zenodo.7362094 Other repositories: None. Author: GoiEner, University of Deusto. Objective of collection: This dataset was originally used to establish a methodology for clustering households according to their electricity consumption. Description: The meaning of each column is described next for each file. raw.tzst: (no column names provided) timestamp; electricity consumption in kWh. imp-pre.tzst, imp-in.tzst, imp-post.tzst: “timestamp”: timestamp; “kWh”: electricity consumption in kWh; “imputed”: binary value indicating whether the row has been obtained by imputation. metadata.csv: “user”: 64-character identifying a user; “start_date”: initial timestamp of the time series; “end_date”: final timestamp of the time series; “length_days”: number of days elapsed between the initial and the final timestamps; “length_years”: number of years elapsed between the initial and the final timestamps; “potential_samples”: number of samples that should be between the initial and the final timestamps of the time series if there were no missing values; “actual_samples”: number of actual samples of the time series; “missing_samples_abs”: number of potential samples minus actual samples; “missing_samples_pct”: potential samples minus actual samples as a percentage; “contract_start_date”: contract start date; “contract_end_date”: contract end date; “contracted_tariff”: type of tariff contracted (2.X: households and SMEs, 3.X: SMEs with high consumption, 6.X: industries, large commercial areas, and farms); “self_consumption_type”: the type of self-consumption to which the users are subscribed; “p1”, “p2”, “p3”, “p4”, “p5”, “p6”: contracted power (in kW) for each of the six time slots; “province”: province where the user is located; “municipality”: municipality where the user is located (municipalities below 50.000 inhabitants have been removed); “zip_code”: post code (post codes of municipalities below 50.000 inhabitants have been removed); “cnae”: CNAE (Clasificación Nacional de Actividades Económicas) code for economic activity classification. 5 star: ⭐⭐⭐ Preprocessing steps: Data cleaning (imputation of missing values using the Last Observation Carried Forward algorithm using weekly seasons); data integration (combination of multiple SIMEL files, i.e. the data sources); data transformation (anonymization, unit conversion, metadata generation). Reuse: This dataset is related to datasets: "A database of features extracted from different electricity load profiles datasets" (DOI 10.5281/zenodo.7382818), where time series feature extraction has been performed. "Measuring the flexibility achieved by a change of tariff" (DOI 10.5281/zenodo.7382924), where the metadata has been extended to include the results of a socio-economic characterization and the answers to a survey about barriers to adapt to a change of tariff. Update policy: There might be a single update in mid-2023. Ethics and legal aspects: The data provided by GoiEner contained values of the CUPS (Meter Point Administration Number), which are personal data. A pre-processing step has been carried out to replace the CUPS by random 64-character hashes. Technical aspects: raw.tzst contains a 15.1 GB folder with 25,559 CSV files; imp-pre.tzst contains a 6.28 GB folder with 12,149 CSV files; imp-in.tzst contains a 4.36 GB folder with 15.562 CSV files; and imp-post.tzst contains a 4.01 GB folder with 17.519 CSV files. Other: None.
Z
Data from: Time Series from Smart Meters
data.niaid.nih.gov
observatorio-cientifico.ua.es
Updated Feb 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cruz Enrique Borges Hernandez; Carlos Quesada-Granja (2025). Time Series from Smart Meters [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4455197
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
Universidad de Deusto
Authors
Cruz Enrique Borges Hernandez; Carlos Quesada-Granja
Description
This file contains technical problems that make it insuitable for public use. Please use this dataset instead.

Name: Time Series from Smart Meters

Summary: The dataset contains: (1) raw and cleaned time series of smart meters from Spanish electric cooperatives, and (2) feature values and metadata extracted from residential load profiles, both from Spanish electric cooperatives and publicly available datasets.

License: CC BY-NC-SA

Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943.

Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein.

Collection Date: The time series on Spanish electric cooperatives were collected between 2014 and 2021. Data on feature values and metadata were extracted between 2020 and 2021.

Publication Date:

DOI: 10.5281/zenodo.4455198

Other repositories:

Author: University of Deusto

Objective of collection: This data is collected originally for invoice of the electrical consumption. In this project will be used to segment the households.

Description: The dataset contains a CSV file with one entry for both each load profile of the Spanish electric cooperatives and publicly available load profiles. The fields that can be found for each entry are (1) metadata such as the originating dataset to which they belong, the start and end dates, number of days that have been imputed and/or extended, provenance (country, administrative division, municipality, zip code, origin identifier (households, businesses, industry, etc. ), classification of socioeconomic conditions, and tariffs; and (2) extracted features, grouped by types such as statistical moments, quantiles, lag d-day autocorrelations, seasonal aggregates, peak and off-peak periods, load factors, energy consumed, features obtained using the R package "tsfeatures", and Catch-22 features. In addition, the dataset also contains a rData file for each entry of the Spanish electric cooperatives. These files contain the original time series (timestamped values), a field indicating whether the signal has been imputed and/or extended and another field indicating the date of the extension.

5 star: ⭐⭐⭐

Preprocessing steps: anonymization, data fusion, imputation of gaps, extension of time series shorter than 800 days, computation of features.

Reuse: NA

Update policy: The data will be updated throughout 2021.

Ethics and legal aspects: Spanish electric cooperative data contains the CUPS (Meter Point Administration Number), which is personal data. A pre-processing step has been carried out to substitute the CUPS by a MD5 hash.

Technical aspects: Decompressed data is quite large (big data).

Other:
f
Data from: Quantifying the Benefits of Imputation over QSAR Methods in...
acs.figshare.com
zip
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas M. Whitehead; Joel Strickland; Gareth J. Conduit; Alexandre Borrel; Daniel Mucs; Irene Baskerville-Abraham (2023). Quantifying the Benefits of Imputation over QSAR Methods in Toxicology Data Modeling [Dataset]. http://doi.org/10.1021/acs.jcim.3c01695.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.3c01695.s002
Dataset updated
Dec 13, 2023
Dataset provided by
ACS Publications
Authors
Thomas M. Whitehead; Joel Strickland; Gareth J. Conduit; Alexandre Borrel; Daniel Mucs; Irene Baskerville-Abraham
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Imputation machine learning (ML) surpasses traditional approaches in modeling toxicity data. The method was tested on an open-source data set comprising approximately 2500 ingredients with limited in vitro and in vivo data obtained from the OECD QSAR Toolbox. By leveraging the relationships between different toxicological end points, imputation extracts more valuable information from each data point compared to well-established single end point methods, such as ML-based Quantitative Structure Activity Relationship (QSAR) approaches, providing a final improvement of up to around 0.2 in the coefficient of determination. A significant aspect of this methodology is its resilience to the inclusion of extraneous chemical or experimental data. While additional data typically introduces a considerable level of noise and can hinder performance of single end point QSAR modeling, imputation models remain unaffected. This implies a reduction in the need for laborious manual preprocessing tasks such as feature selection, thereby making data preparation for ML analysis more efficient. This successful test, conducted on open-source data, validates the efficacy of imputation approaches in toxicity data analysis. This work opens the way for applying similar methods to other types of sparse toxicological data matrices, and so we discuss the development of regulatory authority guidelines to accept imputation models, a key aspect for the wider adoption of these methods.
f
Supplementary file 1_weIMPUTE: a user-friendly web-based genotype imputation...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu, Helong; Lin, Jian; Hu, Xiaodong; Ye, Guanshi; Yu, Jun; Liu, Defu; Li, Mingliang; Li, Zhuo; Tang, You; Li, Qi; Bi, Chunguang (2025). Supplementary file 1_weIMPUTE: a user-friendly web-based genotype imputation platform.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002100841
Explore at:
Dataset updated
Mar 17, 2025
Authors
Yu, Helong; Lin, Jian; Hu, Xiaodong; Ye, Guanshi; Yu, Jun; Liu, Defu; Li, Mingliang; Li, Zhuo; Tang, You; Li, Qi; Bi, Chunguang
Description
BackgroundGenotype imputation is a critical preprocessing step in genome-wide association studies (GWAS), enhancing statistical power for detecting associated single nucleotide polymorphisms (SNPs) by increasing marker size.ResultsIn response to the needs of researchers seeking user-friendly graphical tools for imputation without requiring informatics or computer expertise, we have developed weIMPUTE, a web-based imputation graphical user interface (GUI). Unlike existing genotype imputation software, weIMPUTE supports multiple imputation software, including SHAPEIT, Eagle, Minimac4, Beagle, and IMPUTE2, while encompassing the entire workflow, from quality control to data format conversion. This comprehensive platform enables both novices and experienced users to readily perform imputation tasks. For reference genotype data owners, weIMPUTE can be installed on a server or workstation, facilitating web-based imputation services without data sharing.ConclusionweIMPUTE represents a versatile imputation solution for researchers across various fields, offering the flexibility to create personalized imputation servers on different operating systems.
The selected explanatory variables.
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0281901.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
Real Estate Price Prediction Data
figshare.com
txt
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah (2024). Real Estate Price Prediction Data [Dataset]. http://doi.org/10.6084/m9.figshare.26517325.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26517325.v1
Dataset updated
Aug 8, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview: This dataset was collected and curated to support research on predicting real estate prices using machine learning algorithms, specifically Support Vector Regression (SVR) and Gradient Boosting Machine (GBM). The dataset includes comprehensive information on residential properties, enabling the development and evaluation of predictive models for accurate and transparent real estate appraisals.Data Source: The data was sourced from Department of Lands and Survey real estate listings.Features: The dataset contains the following key attributes for each property:Area (in square meters): The total living area of the property.Floor Number: The floor on which the property is located.Location: Geographic coordinates or city/region where the property is situated.Type of Apartment: The classification of the property, such as studio, one-bedroom, two-bedroom, etc.Number of Bathrooms: The total number of bathrooms in the property.Number of Bedrooms: The total number of bedrooms in the property.Property Age (in years): The number of years since the property was constructed.Property Condition: A categorical variable indicating the condition of the property (e.g., new, good, fair, needs renovation).Proximity to Amenities: The distance to nearby amenities such as schools, hospitals, shopping centers, and public transportation.Market Price (target variable): The actual sale price or listed price of the property.Data Preprocessing:Normalization: Numeric features such as area and proximity to amenities were normalized to ensure consistency and improve model performance.Categorical Encoding: Categorical features like property condition and type of apartment were encoded using one-hot encoding or label encoding, depending on the specific model requirements.Missing Values: Missing data points were handled using appropriate imputation techniques or by excluding records with significant missing information.Usage: This dataset was utilized to train and test machine learning models, aiming to predict the market price of residential properties based on the provided attributes. The models developed using this dataset demonstrated improved accuracy and transparency over traditional appraisal methods.Dataset Availability: The dataset is available for public use under the [CC BY 4.0]. Users are encouraged to cite the related publication when using the data in their research or applications.Citation: If you use this dataset in your research, please cite the following publication:[Real Estate Decision-Making: Precision in Price Prediction through Advanced Machine Learning Algorithms].
f
Detailed overview of feature information.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pishgar, Maryam; Li, Hexin; Chen, Yubing; Ashrafi, Negin; Zhao, Guanlan; Kang, Chris (2024). Detailed overview of feature information. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001305744
Explore at:
Dataset updated
Sep 4, 2024
Authors
Pishgar, Maryam; Li, Hexin; Chen, Yubing; Ashrafi, Negin; Zhao, Guanlan; Kang, Chris
Description
BackgroundMechanical ventilation (MV) is vital for critically ill ICU patients but carries significant mortality risks. This study aims to develop a predictive model to estimate hospital mortality among MV patients, utilizing comprehensive health data to assist ICU physicians with early-stage alerts.MethodsWe developed a Machine Learning (ML) framework to predict hospital mortality in ICU patients receiving MV. Using the MIMIC-III database, we identified 25,202 eligible patients through ICD-9 codes. We employed backward elimination and the Lasso method, selecting 32 features based on clinical insights and literature. Data preprocessing included eliminating columns with over 90% missing data and using mean imputation for the remaining missing values. To address class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE). We evaluated several ML models, including CatBoost, XGBoost, Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Logistic Regression, using a 70/30 train-test split. The CatBoost model was chosen for its superior performance in terms of accuracy, precision, recall, F1-score, AUROC metrics, and calibration plots.ResultsThe study involved a cohort of 25,202 patients on MV. The CatBoost model attained an AUROC of 0.862, an increase from an initial AUROC of 0.821, which was the best reported in the literature. It also demonstrated an accuracy of 0.789, an F1-score of 0.747, and better calibration, outperforming other models. These improvements are due to systematic feature selection and the robust gradient boosting architecture of CatBoost.ConclusionThe preprocessing methodology significantly reduced the number of relevant features, simplifying computational processes, and identified critical features previously overlooked. Integrating these features and tuning the parameters, our model demonstrated strong generalization to unseen data. This highlights the potential of ML as a crucial tool in ICUs, enhancing resource allocation and providing more personalized interventions for MV patients.
Handling Missing Data Example Dataset
kaggle.com
zip
Updated Aug 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PRINCE1204 (2025). Handling Missing Data Example Dataset [Dataset]. https://www.kaggle.com/prince1204/handling-missing-data-example-dataset
Explore at:
zip(10211 bytes)Available download formats
Dataset updated
Aug 21, 2025
Authors
PRINCE1204
Description
📊 Dataset Description – Handling Missing Data

This dataset contains 1,000 employee records across different departments and cities, designed for practicing data cleaning, preprocessing, and handling missing values in real-world scenarios.

🔹 Features (Columns)

ID (Integer): Unique identifier for each employee.

Age (Float): Age of the employee (some values are missing).

Salary (Float): Annual salary of the employee in USD (some values are missing).

Experience (Float): Total years of professional experience (some values are missing).

Department (Categorical): Department of the employee (e.g., IT, Sales, Finance, Admin) – contains missing values.

City (Categorical): Work location of the employee (e.g., London, Berlin, New York) – contains missing values.

🔹 Missing Data Information

Columns Age, Salary, Experience, Department, and City contain around 100 missing values each.

The dataset is ideal for testing different missing data handling techniques, such as:

Mean / Median / Mode imputation

Random sampling imputation

Forward / Backward filling

Predictive modeling approaches

🔹 Use Cases

🧹 Practice data cleaning & preprocessing for ML projects.

🔧 Explore imputation techniques for both numerical and categorical data.

🤖 Build predictive models while handling incomplete datasets.

🎓 Great for educational purposes, tutorials, and workshops on missing data handling.
DataSheet1_Machine learning-based prediction model for the efficacy and...
frontiersin.figshare.com
docx
Updated Jul 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Xiong; Xiaoyang Liu; Qing Wang; Li Zhao; Xudong Kong; Chunhe Da; Zuohuan Meng; Leilei Qu; Qinfang Xia; Lihong Liu; Pengmei Li (2024). DataSheet1_Machine learning-based prediction model for the efficacy and safety of statins.docx [Dataset]. http://doi.org/10.3389/fphar.2024.1334929.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2024.1334929.s001
Dataset updated
Jul 29, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Yu Xiong; Xiaoyang Liu; Qing Wang; Li Zhao; Xudong Kong; Chunhe Da; Zuohuan Meng; Leilei Qu; Qinfang Xia; Lihong Liu; Pengmei Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe appropriate use of statins plays a vital role in reducing the risk of atherosclerotic cardiovascular disease (ASCVD). However, due to changes in diet and lifestyle, there has been a significant increase in the number of individuals with high cholesterol levels. Therefore, it is crucial to ensure the rational use of statins. Adverse reactions associated with statins, including liver enzyme abnormalities and statin-associated muscle symptoms (SAMS), have impacted their widespread utilization. In this study, we aimed to develop a predictive model for statin efficacy and safety based on real-world clinical data using machine learning techniques.MethodsWe employed various data preprocessing techniques, such as improved random forest imputation and Borderline SMOTE oversampling, to handle the dataset. Boruta method was utilized for feature selection, and the dataset was divided into training and testing sets in a 7:3 ratio. Five algorithms, including logistic regression, naive Bayes, decision tree, random forest, and gradient boosting decision tree, were used to construct the predictive models. Ten-fold cross-validation and bootstrapping sampling were performed for internal and external validation. Additionally, SHAP (SHapley Additive exPlanations) was employed for feature interpretability. Ultimately, an accessible web-based platform for predicting statin efficacy and safety was established based on the optimal predictive model.ResultsThe random forest algorithm exhibited the best performance among the five algorithms. The predictive models for LDL-C target attainment (AUC = 0.883, Accuracy = 0.868, Precision = 0.858, Recall = 0.863, F1 = 0.860, AUPRC = 0.906, MCC = 0.761), liver enzyme abnormalities (AUC = 0.964, Accuracy = 0.964, Precision = 0.967, Recall = 0.963, F1 = 0.965, AUPRC = 0.978, MCC = 0.938), and muscle pain/Creatine kinase (CK) abnormalities (AUC = 0.981, Accuracy = 0.980, Precision = 0.987, Recall = 0.975, F1 = 0.981, AUPRC = 0.987, MCC = 0.965) demonstrated favorable performance. The most important features of LDL-C target attainment prediction model was cerebral infarction, TG, PLT and HDL. The most important features of liver enzyme abnormalities model was CRP, CK and number of oral medications. Similarly, AST, ALT, PLT and number of oral medications were found to be important features for muscle pain/CK abnormalities. Based on the best-performing predictive model, a user-friendly web application was designed and implemented.ConclusionThis study presented a machine learning-based predictive model for statin efficacy and safety. The platform developed can assist in guiding statin therapy decisions and optimizing treatment strategies. Further research and application of the model are warranted to improve the utilization of statin therapy.
Retail Product Dataset with Missing Values
kaggle.com
zip
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Explore at:
zip(47826 bytes)Available download formats
Dataset updated
Feb 17, 2025
Authors
Himel Sarder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage

This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
u
Data from: Prediction of Mild Cognitive Impairment in Older Adults
produccioncientifica.ucm.es
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernández-Blázquez, Miguel Ángel; Ruiz-Sánchez de León, José María; Gómez-Ramírez, Jaime; Fernández-Blázquez, Miguel Ángel; Ruiz-Sánchez de León, José María; Gómez-Ramírez, Jaime (2025). Prediction of Mild Cognitive Impairment in Older Adults [Dataset]. https://produccioncientifica.ucm.es/documentos/67e115d47bcb023de59b2115
Explore at:
Dataset updated
2025
Authors
Fernández-Blázquez, Miguel Ángel; Ruiz-Sánchez de León, José María; Gómez-Ramírez, Jaime; Fernández-Blázquez, Miguel Ángel; Ruiz-Sánchez de León, José María; Gómez-Ramírez, Jaime
Description
This dataset consists of 845 cognitively healthy Spanish individuals, aged 65 to 87 years at baseline, all living independently at home and free from significant psychiatric, neurological, or systemic disorders. The data were collected with the aim of improving the early detection and prevention of mild cognitive impairment (MCI) and dementia. To achieve this, all participants underwent a comprehensive assessment protocol, typically completed within four hours, with appropriate breaks provided.

The full assessment included a semi-structured clinical interview, as well as neurological and neuropsychological evaluations. Consequently, the dataset contains information on 219 variables, organized into four categories: Sociodemographic, Self-reported, Medical Examination, and Cognitive Assessment.

This dataset was utilized to develop four eXtreme Gradient Boosting (XGBoost) models of increasing complexity. The models were trained and evaluated using robust preprocessing techniques, including multiple imputation for handling missing data and the Synthetic Minority Oversampling Technique (SMOTE) for class balancing. Three versions of the dataset are provided here: the original dataset, the dataset with multiple imputations, and the balanced dataset.
f
Feature Importance Based on MI Scores.
figshare.com
xls
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri (2025). Feature Importance Based on MI Scores. [Dataset]. http://doi.org/10.1371/journal.pone.0330454.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0330454.t008
Dataset updated
Sep 30, 2025
Dataset provided by
PLOS ONE
Authors
Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.
📊 Telco Customer Churn Dataset
kaggle.com
zip
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin Kleon (2025). 📊 Telco Customer Churn Dataset [Dataset]. https://www.kaggle.com/datasets/jethwaaatmik/telco-customer-churn-dataset
Explore at:
zip(172687 bytes)Available download formats
Dataset updated
Jul 18, 2025
Authors
Austin Kleon
Description
📝 Dataset Description This dataset contains information about customers of a telecommunications company, including their demographic details, account information, service subscriptions, and churn status. It is a modified version of the popular Telco Churn dataset, curated for exploratory data analysis, machine learning model development, and churn prediction tasks.

The dataset includes simulated missing values in some columns to reflect real-world data issues and support preprocessing and imputation tasks. This makes it especially useful for demonstrating data cleaning techniques and evaluating model robustness.

📂 Files Included telco_data_modified.csv: The main dataset with 21 columns and 7043 rows (some missing values are intentionally inserted).

📌 Features Column Name Description customerID Unique identifier for each customer gender Customer gender: Male/Female SeniorCitizen Indicates if the customer is a senior citizen (0 = No, 1 = Yes) Partner Whether the customer has a partner Dependents Whether the customer has dependents tenure Number of months the customer has stayed with the company PhoneService Whether the customer has phone service MultipleLines Whether the customer has multiple lines InternetService Customer's internet service provider (DSL, Fiber optic, No) OnlineSecurity Whether the customer has online security OnlineBackup Whether the customer has online backup DeviceProtection Whether the customer has device protection TechSupport Whether the customer has tech support StreamingTV Whether the customer has streaming TV StreamingMovies Whether the customer has streaming movies Contract Type of contract: Month-to-month, One year, Two year PaperlessBilling Whether the customer uses paperless billing PaymentMethod Payment method: (e.g., Electronic check, Mailed check, etc.) MonthlyCharges Monthly charges TotalCharges Total charges to date Churn Whether the customer has left the company (Yes/No)

🔍 Use Cases Binary classification: Predict customer churn

Data preprocessing and imputation exercises

Feature engineering and importance analysis

Customer segmentation and churn modeling

⚠️ Notes Missing values were intentionally inserted in the dataset to help simulate real-world conditions.

Some preprocessing may be required before modeling (e.g., converting categorical to numerical data, handling TotalCharges as numeric).

🏷️ Tags

telecom #churn #classification #customer-analytics #data-cleaning #feature-engineering

🙏 Acknowledgements This dataset is based on the original Telco Customer Churn dataset (initially provided by IBM). The current version has been modified for academic and practical exercises.
Extrovert vs. Introvert Behavior Data
kaggle.com
zip
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Kapilavayi (2025). Extrovert vs. Introvert Behavior Data [Dataset]. https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data/discussion
Explore at:
zip(31277 bytes)Available download formats
Dataset updated
Jun 13, 2025
Authors
Rakesh Kapilavayi
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

Dive into the Extrovert vs. Introvert Personality Traits Dataset, a rich collection of behavioral and social data designed to explore the spectrum of human personality. This dataset captures key indicators of extroversion and introversion, making it a valuable resource for psychologists, data scientists, and researchers studying social behavior, personality prediction, or data preprocessing techniques.

Context

Personality traits like extroversion and introversion shape how individuals interact with their social environments. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to uncovering fascinating insights.

Dataset Details

Size: The dataset contains 2,900 rows and 8 columns.

Features:

- Time_spent_Alone: Hours spent alone daily (0–11). - Stage_fear: Presence of stage fright (Yes/No). - Social_event_attendance: Frequency of social events (0–10). - Going_outside: Frequency of going outside (0–7). - Drained_after_socializing: Feeling drained after socializing (Yes/No). - Friends_circle_size: Number of close friends (0–15). - Post_frequency: Social media post frequency (0–10). - Personality: Target variable (Extrovert/Introvert).*

Data Quality: Includes some missing values, ideal for practicing imputation and preprocessing. Format: Single CSV file, compatible with Python, R, and other tools.*

Data Quality Notes

Contains missing values in columns like Time_spent_Alone and Going_outside, offering opportunities for data cleaning practice.

Balanced classes ensure robust model training.

Binary categorical variables simplify encoding tasks.

Potential Use Cases

Build machine learning models to predict personality types.

Analyze correlations between social behaviors and personality traits.

Explore social media engagement patterns.

Practice data preprocessing techniques like imputation and encoding.

Create visualizations to uncover behavioral trends.
Final Features Selected by LassoR.
figshare.com
xls
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri (2025). Final Features Selected by LassoR. [Dataset]. http://doi.org/10.1371/journal.pone.0330454.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0330454.t009
Dataset updated
Sep 30, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Rahman Farnoosh; Karlo Abnoosian; Rasha Abbas Isewid; Danial Javaheri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Type 2 diabetes mellitus remains a critical global health challenge, with rising incidence rates placing immense pressure on healthcare systems worldwide. This chronic metabolic disorder affects diverse populations, including the elderly and children, leading to severe complications. Early and accurate prediction is essential to mitigate these consequences, yet traditional models often struggle with challenges such as imbalanced datasets, high-dimensional data, missing values, and outliers, resulting in limited predictive performance and interpretability. This study introduces DiabetesXpertNet, an innovative deep learning framework designed to enhance the prediction of Type 2 diabetes mellitus. Unlike existing convolutional neural network models optimized for image data, which focus on generalized attention mechanisms, DiabetesXpertNet is specifically tailored for tabular medical data. It incorporates a convolutional neural network architecture with dynamic channel attention modules to prioritize clinically significant features, such as glucose and insulin levels, and a context-aware feature enhancer to capture complex sequential relationships within structured datasets. The model employs advanced preprocessing techniques, including mean imputation for missing values, median replacement for outliers, and feature selection through mutual information and LASSO regression, to improve dataset quality and computational efficiency. Additionally, a logistic regression-based class weighting strategy addresses class imbalance, enhancing model fairness. Evaluated on the PID dataset and Frankfurt Hospital, Germany Diabetes datasets, DiabetesXpertNet achieves an accuracy of 89.98%, AUC of 91.95%, precision of 89.08%, recall of 88.11%, and F1-score of 88.01%, outperforming existing machine learning and deep learning models. Compared to traditional machine learning approaches, it demonstrates significant improvements in precision (+5.1%), recall (+4.8%), F1-score (+5.1%), accuracy (+6.0%), and AUC (+4.5%). Against other convolutional neural network models, it shows meaningful gains in precision (+2.2%), recall (+1.1%), F1-score (+1.2%), accuracy (+1.9%), and AUC (+0.6%). These results underscore the robustness and interpretability of DiabetesXpertNet, making it a promising tool for early Type 2 diabetes diagnosis in clinical settings.
LIFE EXPECTANCY
kaggle.com
zip
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DavidGatt (2024). LIFE EXPECTANCY [Dataset]. https://www.kaggle.com/datasets/davidgatt222/life-expectansy-dataset
Explore at:
zip(1089650 bytes)Available download formats
Dataset updated
Oct 21, 2024
Authors
DavidGatt
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Overview This project analyzes life expectancy across countries, utilizing data from 2000 to 2015. The study examines how key socioeconomic and health factors influence life expectancy. Factors such as GDP, adult mortality, schooling, HIV/AIDS prevalence, and BMI are included in the analysis, which uses multiple linear regression and mixed-effects modeling to determine which variables significantly affect life expectancy.

Data Description The dataset includes life expectancy information and its influencing factors from various countries over a 15-year period (2000-2015). The data was sourced from the WHO Life Expectancy Dataset available on Kaggle. It comprises both continuous and categorical variables, including: • Life Expectancy (Dependent Variable): Average number of years an individual is expected to live. Continuous Variables: o GDP per capita o Adult Mortality (per 1000 individuals aged 15-65) o Schooling (mean years of education) o Alcohol consumption per capita Categorical Variables: o HIV/AIDS prevalence o Country status (Developed vs. Developing) o BMI category (Underweight, Normal, Overweight, Obese)

Problem Statement Life expectancy is a crucial metric for assessing the overall health and well-being of populations. It varies significantly between countries due to economic, social, and health factors. This project seeks to identify the most important variables that predict life expectancy, offering insights for policymakers on improving public health and longevity in their populations. Hypotheses 1. Higher GDP leads to higher life expectancy. 2. Higher adult mortality results in lower life expectancy. 3. More years of schooling increase life expectancy. 4. Higher HIV/AIDS prevalence reduces life expectancy. 5. Living in a developed country increases life expectancy. 6. Higher BMI (underweight or obese) correlates with reduced life expectancy. 7. Higher alcohol consumption reduces life expectancy.

Methodology • Data Preprocessing: Missing values were handled by imputation, and skewed variables (like GDP) were log-transformed to improve model performance. • Exploratory Data Analysis: Visualizations (histograms, scatterplots, and box plots) were used to understand the relationships between independent variables and life expectancy. Modeling: o Multiple Linear Regression was used to examine how each continuous and categorical variable impacts life expectancy. o Mixed-effects modeling was applied to account for country-specific effects, capturing variability across different nations.

Key Results 1. GDP: Log-transformed GDP had a significant positive effect on life expectancy, with an adjusted R² of 0.29. Higher income is positively correlated with longer life expectancy. 2. Adult Mortality: Increased adult mortality significantly reduced life expectancy. For every unit increase in adult mortality, life expectancy decreased by 0.042 years. 3. Schooling: More years of schooling was strongly correlated with longer life expectancy, reflecting the importance of education in enhancing health outcomes. 4. HIV/AIDS: Countries with higher HIV/AIDS prevalence had lower life expectancy, with significant negative coefficients for all levels of prevalence. 5. Country Status: Developed countries had significantly higher life expectancy than developing countries, with an average difference of about 1.52 years. 6. BMI: While underweight and obese categories were significant predictors, the relationship between BMI and life expectancy was complex, suggesting that high-income countries might offset health risks through medical care. 7. Alcohol Consumption: Contrary to initial expectations, alcohol consumption did not have a statistically significant effect on life expectancy in this model.

Facebook

Twitter

Click to copy link

Link copied

Cite

Pankesh Patel (2019). Data-preprocessing-Imputation [Dataset]. https://www.kaggle.com/pankeshpatel/datapreprocessingimputation

Data-preprocessing-Imputation

Explore at:

93 scholarly articles cite this dataset (View in Google Scholar)

zip(327 bytes)Available download formats

Dataset updated

May 18, 2019

Authors

Pankesh Patel

Description

Dataset

This dataset was created by Pankesh Patel

Clear search

Close search

Google apps

Main menu

Data-preprocessing-Imputation

Dataset

Contents

Proteomics Data Preprocessing Simulation, KNN PCA

Synthetic Stroke Prediction Dataset

Personal Information and Life Status Dataset

Personal Information and Life Status Dataset

Dataset Content

Potential Use Cases

Acknowledgement

Data from: GoiEner smart meters data

Data from: Time Series from Smart Meters

Data from: Quantifying the Benefits of Imputation over QSAR Methods in...

Supplementary file 1_weIMPUTE: a user-friendly web-based genotype imputation...

The selected explanatory variables.

Real Estate Price Prediction Data

Detailed overview of feature information.

Handling Missing Data Example Dataset

📊 Dataset Description – Handling Missing Data

🔹 Features (Columns)

🔹 Missing Data Information

🔹 Use Cases

DataSheet1_Machine learning-based prediction model for the efficacy and...

Retail Product Dataset with Missing Values

Data from: Prediction of Mild Cognitive Impairment in Older Adults

Feature Importance Based on MI Scores.

📊 Telco Customer Churn Dataset

telecom #churn #classification #customer-analytics #data-cleaning #feature-engineering

Extrovert vs. Introvert Behavior Data

Final Features Selected by LassoR.

LIFE EXPECTANCY

Data-preprocessing-Imputation

Dataset

Contents