16 datasets found

d
Replication Data for: Comparative investigation of time series missing data...
dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZANG, LEIZHEN; Feng XIONG (2023). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/GQHURF
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
ZANG, LEIZHEN; Feng XIONG
Description
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
d
New Approach to Evaluating Supplementary Homicide Report (SHR) Data...
catalog.data.gov
icpsr.umich.edu
Updated Nov 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 [Dataset]. https://catalog.data.gov/dataset/new-approach-to-evaluating-supplementary-homicide-report-shr-data-imputation-1990-1995-ff769
Explore at:
Dataset updated
Nov 14, 2025
Dataset provided by
National Institute of Justice
Description
The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.
Retail Product Dataset with Missing Values
kaggle.com
zip
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Explore at:
zip(47826 bytes)Available download formats
Dataset updated
Feb 17, 2025
Authors
Himel Sarder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage

This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Descriptive statistics for original data and complete data after imputation...
plos.figshare.com
xls
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Chiara Liverani; Eleni Kalogirou; Catherine Rivier; Edouard Gentaz (2023). Descriptive statistics for original data and complete data after imputation in Experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0289027.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289027.t004
Dataset updated
Nov 29, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Maria Chiara Liverani; Eleni Kalogirou; Catherine Rivier; Edouard Gentaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptive statistics for original data and complete data after imputation in Experiment 2.
Smartphones Dataset (August 2024)
kaggle.com
zip
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilkush Singh (2024). Smartphones Dataset (August 2024) [Dataset]. https://www.kaggle.com/datasets/dilkushsingh/smartphones-dataset-upto-july24
Explore at:
zip(605033 bytes)Available download formats
Dataset updated
Aug 24, 2024
Authors
Dilkush Singh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Smartphones Dataset (August 2024)

This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
- If you want to know about the web scrapping process then read the blog Medium Article - If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo GitHub Repo

Dataset Versions:

Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.

Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.

Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.

Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.
Data from: Design-Based Causal Inference with Missing Outcomes: Missingness...
tandf.figshare.com
zip
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siyu Heng; Jiawei Zhang; Yang Feng (2025). Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment [Dataset]. http://doi.org/10.6084/m9.figshare.29356876.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29356876.v1
Dataset updated
Aug 29, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Siyu Heng; Jiawei Zhang; Yang Feng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Design-based causal inference, also known as randomization-based or finite-population causal inference, is one of the most widely used causal inference frameworks, largely due to the merit that its validity can be guaranteed by study design (e.g., randomized experiments) and does not require assuming specific outcome-generating distributions or super-population models. Despite its advantages, design-based causal inference can still suffer from other issues, among which outcome missingness is a prevalent and significant challenge. This work systematically studies the outcome missingness problem in design-based causal inference. First, we propose a general and flexible outcome missingness mechanism that can facilitate finite-population-exact randomization tests of no treatment effect. Second, under this general missingness mechanism, we propose a general framework called “imputation and re-imputation” for conducting randomization tests in design-based causal inference with missing outcomes. We prove that our framework can still ensure finite-population-exact Type-I error rate control even when the imputation model was misspecified or when unobserved covariates or interference exist in the missingness mechanism. Third, we extend our framework to conduct covariate adjustment in randomization tests and construct finite-population-valid confidence regions with missing outcomes. Our framework is evaluated via extensive simulation studies and applied to a large-scale randomized experiment. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
A Hybrid Educational Dataset
kaggle.com
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanoel Carvalho Lopes (2025). A Hybrid Educational Dataset [Dataset]. https://www.kaggle.com/datasets/emanoelcarvalholopes/uci-oulad-sintetico-unificados
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Emanoel Carvalho Lopes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.

This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:

The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.

The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.

A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.

Data Unification and Pre-processing

A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:

Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.

One-Hot Encoding: All categorical features have been converted to a numerical format.

Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.

The result is a clean, comprehensive dataset ready for modeling.

File Information

Instance

Each row represents a student profile, and the columns are the features and the target.

Feature

Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).

Sensitive Information

The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.

Key Columns:

Target Variable: had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset. 1: The student either failed (Fail) or withdrew (Withdrawn) from the course. 0: The student passed (Pass or Distinction). Feature Groups: OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE. Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources. Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics. Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.

(Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)

Acknowledgements

This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:

OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance

Inspiration

This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:

Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).

Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).

Can you build separate models for each data origin (origem_dado_*) and compare ...
Factors associated with each domain of burnout among resident physicians...
plos.figshare.com
xls
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vithawat Surawattanasakul; Penprapa Siviroj; Wuttipat Kiratipaisarl (2024). Factors associated with each domain of burnout among resident physicians (multiple imputation, n = 296). [Dataset]. http://doi.org/10.1371/journal.pone.0312839.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312839.t003
Dataset updated
Oct 30, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Vithawat Surawattanasakul; Penprapa Siviroj; Wuttipat Kiratipaisarl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Factors associated with each domain of burnout among resident physicians (multiple imputation, n = 296).
📊 Telco Customer Churn Dataset
kaggle.com
zip
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin Kleon (2025). 📊 Telco Customer Churn Dataset [Dataset]. https://www.kaggle.com/datasets/jethwaaatmik/telco-customer-churn-dataset
Explore at:
zip(172687 bytes)Available download formats
Dataset updated
Jul 18, 2025
Authors
Austin Kleon
Description
📝 Dataset Description This dataset contains information about customers of a telecommunications company, including their demographic details, account information, service subscriptions, and churn status. It is a modified version of the popular Telco Churn dataset, curated for exploratory data analysis, machine learning model development, and churn prediction tasks.

The dataset includes simulated missing values in some columns to reflect real-world data issues and support preprocessing and imputation tasks. This makes it especially useful for demonstrating data cleaning techniques and evaluating model robustness.

📂 Files Included telco_data_modified.csv: The main dataset with 21 columns and 7043 rows (some missing values are intentionally inserted).

📌 Features Column Name Description customerID Unique identifier for each customer gender Customer gender: Male/Female SeniorCitizen Indicates if the customer is a senior citizen (0 = No, 1 = Yes) Partner Whether the customer has a partner Dependents Whether the customer has dependents tenure Number of months the customer has stayed with the company PhoneService Whether the customer has phone service MultipleLines Whether the customer has multiple lines InternetService Customer's internet service provider (DSL, Fiber optic, No) OnlineSecurity Whether the customer has online security OnlineBackup Whether the customer has online backup DeviceProtection Whether the customer has device protection TechSupport Whether the customer has tech support StreamingTV Whether the customer has streaming TV StreamingMovies Whether the customer has streaming movies Contract Type of contract: Month-to-month, One year, Two year PaperlessBilling Whether the customer uses paperless billing PaymentMethod Payment method: (e.g., Electronic check, Mailed check, etc.) MonthlyCharges Monthly charges TotalCharges Total charges to date Churn Whether the customer has left the company (Yes/No)

🔍 Use Cases Binary classification: Predict customer churn

Data preprocessing and imputation exercises

Feature engineering and importance analysis

Customer segmentation and churn modeling

⚠️ Notes Missing values were intentionally inserted in the dataset to help simulate real-world conditions.

Some preprocessing may be required before modeling (e.g., converting categorical to numerical data, handling TotalCharges as numeric).

🏷️ Tags

telecom #churn #classification #customer-analytics #data-cleaning #feature-engineering

🙏 Acknowledgements This dataset is based on the original Telco Customer Churn dataset (initially provided by IBM). The current version has been modified for academic and practical exercises.
f
Supplementary file 2_Machine learning enables early risk stratification of...
figshare.com
xlsx
Updated Oct 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feng Han; Yuanshui Liu; Huamei Li; Xiaofang Chen; Liqiu Liang; Dongchuan Xu; Lijiao Ye; Yanhong Ouyang; Ping He; Wang Liao (2025). Supplementary file 2_Machine learning enables early risk stratification of hymenopteran stings: evidence from a tropical multicenter cohort.xlsx [Dataset]. http://doi.org/10.3389/fpubh.2025.1664606.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2025.1664606.s004
Dataset updated
Oct 28, 2025
Dataset provided by
Frontiers
Authors
Feng Han; Yuanshui Liu; Huamei Li; Xiaofang Chen; Liqiu Liang; Dongchuan Xu; Lijiao Ye; Yanhong Ouyang; Ping He; Wang Liao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundHymenopteran stings (from bees, wasps, and hornets) can trigger severe systemic reactions, especially in tropical regions, risking patient safety and emergency care efficiency. Accurate early risk stratification is essential to guide timely intervention.ObjectiveTo develop and validate an interpretable machine learning model for early prediction of severe outcomes following hymenopteran stings.MethodsWe retrospectively analyzed 942 cases from a multicenter cohort in Hainan Province, China. Questionnaires with >20% missing data were excluded. Mean substitution was applied for primary missing data imputation, with multiple imputation by chained equations (MICE) used for sensitivity analysis. Seven supervised classifiers were trained using five-fold cross-validation; class imbalance was addressed using the adaptive synthetic sampling (ADASYN) algorithm. Model performance was evaluated via area under the receiver operating characteristic curve (AUC), recall, and precision, and feature importance was interpreted using Shapley additive explanations (SHAP) values.ResultsAmong 942 patients, 8.7% developed severe systemic complications. The distribution by species was: wasps (25.5%), honey bees (8.9%), and unknown species (65.6%). The optimal Extra Trees model achieved an AUC of 0.982, recall of 0.956, and precision of 0.926 in the held-out validation set. Key predictors included hypotension, dyspnea, altered mental status, elevated leukocyte counts, and abnormal creatinine levels. A web-based risk calculator was deployed for bedside application. Given the small number of high-risk cases, these high AUC values may overestimate real-world performance and require external validation.ConclusionWe developed an interpretable, deployable tool for early triage of hymenopteran sting patients in tropical settings. Emergency integration may improve clinical decisions and outcomes.
US traffic data with weather and calendar dataset
kaggle.com
zip
Updated Aug 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maryam Shoaei (2023). US traffic data with weather and calendar dataset [Dataset]. https://www.kaggle.com/datasets/maryamshoaei/us-traffic-data-with-weather-and-calendar-dataset/data
Explore at:
zip(367139 bytes)Available download formats
Dataset updated
Aug 25, 2023
Authors
Maryam Shoaei
Area covered
United States
Description
This dataset was collected for a research paper titled "Twitter-informed Prediction for Urban Traffic Flow Using Machine Learning," which is available online at https://ieeexplore.ieee.org/document/10185516. If you intend to use this dataset, we kindly request that you consider acknowledging our paper by including a citation. Your support in referencing our work would be greatly appreciated.

The traffic dataset was obtained through the California Performance Measurement System (PeMS) in the United States. It encompasses traffic data, including speed and flow information, for the eastbound lanes of the Ventura Highway in Los Angeles, covering the period from February 1 to May 31, 2020.

Calendar features in this dataset consist of weekdays, represented as numbers from 1 to 7, and a binary variable indicating whether a specific day is a holiday. Weather data was sourced from the Wunderground website (accessible at https://www.wunderground.com/history/daily/KLAX) throughout the study period. Weather data includes hourly observations of various meteorological factors. For consistency, we assume that weather conditions remain constant during each 5-minute time interval within an hour.

Weather conditions in the dataset include categories such as fair, blowing dust, cloudy, cloudy/windy, fair/windy, fog, haze, heavy rain, light rain, mostly cloudy, mostly cloudy/windy, partly cloudy/windy, rain, and thunder in the vicinity. Temperature is measured in Fahrenheit.

Missing data in this context refers to temporary disruptions in the availability of traffic information within specific areas of the transportation network due to sensor failures or noisy data. To address these missing values, we employed the mean imputation method.
MedSec-25: IoMT Cybersecurity Dataset
kaggle.com
zip
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Abdullah (2025). MedSec-25: IoMT Cybersecurity Dataset [Dataset]. https://www.kaggle.com/datasets/abdullah001234/medsec-25-iomt-cybersecurity-dataset
Explore at:
zip(38496221 bytes)Available download formats
Dataset updated
Sep 8, 2025
Authors
Muhammad Abdullah
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

MedSec-25 is a comprehensive, labeled network traffic dataset designed specifically for the Internet of Medical Things (IoMT) in healthcare environments. It addresses the limitations of existing generic IoT datasets by capturing realistic traffic from a custom-built healthcare IoT lab that mimics real-world hospital operations. The dataset includes both benign (normal) traffic and malicious traffic from multi-staged attack campaigns inspired by the MITRE ATT&CK framework. This allows for the development and evaluation of machine learning-based intrusion detection systems (IDS) tailored to IoMT scenarios, where patient safety and data privacy are critical. The dataset was generated using a variety of medical sensors (e.g., ECG, EEG, HHI, Respiration, SpO2) and environmental sensors (e.g., thermistor, ultrasonic, PIR, flame) connected via Raspberry Pi nodes and an IoT server. Traffic was captured over 7.5 hours using tools like Wireshark and tcpdump, resulting in PCAPNG files. These were processed with CICFlowMeter to extract flow-based features, producing a cleaned CSV dataset with 554,534 bidirectional network flows and 84 features.

Key Highlights:

Realistic Setup: Built in a physical lab at Rochester Institute of Technology, Dubai, incorporating diverse IoMT devices, protocols (e.g., MQTT, SSH, Telnet, FTP, HTTP, DNS), and real-time patient interactions (anonymized to comply with privacy regulations like HIPAA).

Multi-Staged Attacks: Unlike datasets focusing on isolated attacks, MedSec-25 simulates full attack chains: Reconnaissance (e.g., SYN/TCP scans, OS fingerprinting), Initial Access (e.g., brute-force, malformed MQTT packets), Lateral Movement (e.g., exploiting vulnerabilities to pivot between devices), and Exfiltration (e.g., data theft via MQTT).

Imbalanced Nature: This is the cleaned (imbalanced) version of the dataset. Users may need to apply balancing techniques (e.g., SMOTE oversampling + random undersampling) for model training, as demonstrated in the associated paper.

Size and Quality: 554,534 rows, no duplicates, no missing values (except 111 NaNs in Flow Byts/s, ~0.02%, which can be handled via imputation). Data types include float64 (45 columns), int64 (34 columns), and object (5 columns: Flow ID, Src IP, Dst IP, Timestamp, Label).

Utility: Preliminary models trained on this dataset (e.g., KNN: 98.09% accuracy, Decision Tree: 98.35% accuracy) show excellent performance for detecting attack stages.

This dataset is ideal for researchers in cybersecurity, machine learning, and healthcare IoT, enabling the creation of an IDS that can detect attacks at different phases to prevent escalation.

Data Collection

Benign Traffic: Generated over two days with active sensors, services (HTTP dashboard for patient monitoring, SSH/Telnet for remote access, FTP for file transfers), and real users (students/faculty) interacting with medical devices. No personally identifiable information was stored.

Malicious Traffic: Two Kali Linux attacker machines simulated MITRE ATT&CK-inspired campaigns using tools like Nmap, Scapy, Metasploit, and custom Python scripts.

Capture Tools: Wireshark and tcpdump for PCAPNG files (total ~1GB: 600MB benign, 400MB malicious).

Processing: Combined PCAP files per label, extracted features with CICFlowMeter, labeled flows manually based on attack phases, and cleaned for ML readiness. The final cleaned CSV is ~350MB.

Features

The dataset includes 84 features extracted by CICFlowMeter, categorized as:

Identifiers: Flow ID, Src IP, Src Port, Dst IP, Dst Port, Protocol, Timestamp.

Time-Series Metrics: Flow Duration, Flow IAT Mean/Std/Max/Min, Fwd/Bwd IAT Tot/Mean/Std/Max/Min.

Size/Count Statistics: Tot Fwd/Bwd Pkts, TotLen Fwd/Bwd Pkts, Fwd/Bwd Pkt Len Max/Min/Mean/Std, Pkt Len Min/Max/Mean/Std/Var, Pkt Size Avg.

Flag Counts: Fwd/Bwd PSH/URG Flags, FIN/SYN/RST/PSH/ACK/URG/CWE/ECE Flag Cnt.

Rates and Ratios: Flow Byts/s, Flow Pkts/s, Fwd/Bwd Pkts/s, Down/Up Ratio, Active/Idle Mean/Std/Max/Min.

Segmentation and Others: Fwd/Bwd Seg Size Avg/Min, Subflow Fwd/Bwd Pkts/Byts, Init Fwd/Bwd Win Byts, Fwd Act Data Pkts, Fwd/Bwd Byts/b Avg, Fwd/Bwd Pkts/b Avg, Fwd/Bwd Blk Rate Avg.

Labels

The dataset is labeled with 5 classes representing benign behavior and attack stages:

Reconnaissance: 401,683 flows Initial Access: 102,090 flows Exfiltration: 25,915 flows Lateral Movement: 12,498 flows Benign: 12,348 flows

Note: The dataset is imbalanced, with Reconnaissance dominating. Apply balancing techniques for optimal ML performance.

Usage

Preprocessing Suggestions: Encode categorical features (e.g., Protocol, Label) using LabelEncoder. Normalize numerical features with Min-Max Scaler or StandardScaler. Handle the minor NaNs in Flow Byts/s via mean imputation.

Model Training: Split into train/test (e.g., 80/20). Suitable for classification tasks w...
Hotel Reviews Dataset
kaggle.com
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waseem AlAstal (2024). Hotel Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/waseemalastal/hotel-reviews-dataset
Explore at:
zip(3410051 bytes)Available download formats
Dataset updated
Aug 27, 2024
Authors
Waseem AlAstal
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context and Sources: This dataset comprises 10,000 hotel reviews collected from various online sources, including platforms like Hotels.com and TripAdvisor. Each entry contains detailed information about the review, the reviewer, and the hotel, providing valuable insights into customer satisfaction and preferences.

Inspiration: This dataset was created to facilitate the analysis of customer reviews in the hospitality industry. It can be used to study customer sentiments, identify trends, and improve hotel services by understanding the key factors that contribute to customer satisfaction.

Use Cases: Sentiment Analysis: Analyze the sentiment of reviews to determine customer satisfaction and identify areas for improvement. Trend Analysis: Identify common themes and trends in customer feedback over time. Recommender Systems: Use the data to build systems that suggest hotels based on user preferences and review patterns. Market Research: Understand customer preferences and competitive positioning within the hotel industry. Dataset Overview: Number of Rows: 10,000 Number of Columns: 25 Key Columns: reviews.text: The text of the review, offering qualitative insights into customer experiences. reviews.rating: The rating given by the reviewer, typically on a scale from 1 to 5. city, country: Geographical location of the hotel, enabling region-specific analysis. reviews.username: The username of the reviewer, which can be used to study review patterns and behaviors. reviews.date: The date the review was written, useful for temporal analysis. Potential Challenges: Missing Data: Some columns like reviews.userCity and reviews.userProvince have missing values, which may require imputation or exclusion during analysis. Data Imbalance: The distribution of ratings might be skewed, which could affect sentiment analysis or other predictive modeling tasks. This dataset is well-suited for various applications in the fields of natural language processing, machine learning, and data analysis within the hospitality industry.
Customer_Financial_Data
kaggle.com
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashob Narendran (2025). Customer_Financial_Data [Dataset]. https://www.kaggle.com/datasets/prashobnarendran/customer-financial-data
Explore at:
zip(62099 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
Prashob Narendran
Description
Context This dataset contains detailed, anonymized information about a bank's customers. It includes demographic data such as age, income, and family size, as well as financial information like mortgage value, credit card ownership, and average spending habits. The data is well-suited for a variety of machine learning tasks, particularly in the domain of financial services and marketing.

Content The dataset consists of 5000 customer records with 14 attributes:

Customer_ID: A unique identifier for each customer.

Age: The customer's age in completed years.

Years_Experience: Years of professional experience.

Annual_Income: Annual income of the customer (in thousands of dollars).

ZIP_Code: The customer's home address ZIP code.

Family_size: The number of individuals in the customer's family.

Avg_Spending: Average monthly spending on credit cards (in thousands of dollars).

Education_Level: A categorical variable for education level (1: Undergraduate, 2: Graduate, 3: Advanced/Professional).

Mortgage: The value of the customer's house mortgage if any (in thousands of dollars).

Has_Consumer_Loan: Binary variable indicating if the customer accepted a personal loan in the last campaign (1: Yes, 0: No). This is a potential target variable.

Has_Securities_Account: Binary variable indicating if the customer has a securities account with the bank.

Has_CD_Account: Binary variable indicating if the customer has a certificate of deposit (CD) account with the bank.

Uses_Online_Banking: Binary variable indicating if the customer uses online banking services.

Has_CreditCard: Binary variable indicating if the customer uses a credit card issued by this bank.

Data Quality Note Some rows contain negative values for the Years_Experience column. This is a data quality issue that may require preprocessing (e.g., imputation by taking the absolute value or using the average of similar age groups).

Potential Use Cases This dataset is excellent for both educational and practical purposes. You can use it to:

Predict Loan Acceptance: Build a classification model to predict which customers are most likely to accept a personal loan (Has_Consumer_Loan).

Customer Segmentation: Use clustering algorithms (like K-Means) to identify distinct customer segments for targeted marketing campaigns.

Credit Card Adoption: Analyze the factors that influence a customer's decision to get a bank-issued credit card.

Exploratory Data Analysis (EDA): Practice your data analysis and visualization skills to uncover insights about customer behavior.
Table1_A personalized prediction model for urinary tract infections in type...
frontiersin.figshare.com
docx
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Xiong; Yu-Meng Liu; Jia-Qiang Hu; Bao-Qiang Zhu; Yuan-Kui Wei; Yan Yang; Xing-Wei Wu; En-Wu Long (2024). Table1_A personalized prediction model for urinary tract infections in type 2 diabetes mellitus using machine learning.DOCX [Dataset]. http://doi.org/10.3389/fphar.2023.1259596.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fphar.2023.1259596.s001
Dataset updated
Jan 5, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Yu Xiong; Yu-Meng Liu; Jia-Qiang Hu; Bao-Qiang Zhu; Yuan-Kui Wei; Yan Yang; Xing-Wei Wu; En-Wu Long
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Patients with type 2 diabetes mellitus (T2DM) are at higher risk for urinary tract infections (UTIs), which greatly impacts their quality of life. Developing a risk prediction model to identify high-risk patients for UTIs in those with T2DM and assisting clinical decision-making can help reduce the incidence of UTIs in T2DM patients. To construct the predictive model, potential relevant variables were first selected from the reference literature, and then data was extracted from the Hospital Information System (HIS) of the Sichuan Academy of Medical Sciences and Sichuan Provincial People’s Hospital for analysis. The data set was split into a training set and a test set in an 8:2 ratio. To handle the data and establish risk warning models, four imputation methods, four balancing methods, three feature screening methods, and eighteen machine learning algorithms were employed. A 10-fold cross-validation technique was applied to internally validate the training set, while the bootstrap method was used for external validation in the test set. The area under the receiver operating characteristic curve (AUC) and decision curve analysis (DCA) were used to evaluate the performance of the models. The contributions of features were interpreted using the SHapley Additive ExPlanation (SHAP) approach. And a web-based prediction platform for UTIs in T2DM was constructed by Flask framework. Finally, 106 variables were identified for analysis from a total of 119 literature sources, and 1340 patients were included in the study. After comprehensive data preprocessing, a total of 48 datasets were generated, and 864 risk warning models were constructed based on various balancing methods, feature selection techniques, and a range of machine learning algorithms. The receiver operating characteristic (ROC) curves were used to assess the performances of these models, and the best model achieved an impressive AUC of 0.9789 upon external validation. Notably, the most critical factors contributing to UTIs in T2DM patients were found to be UTIs-related inflammatory markers, medication use, mainly SGLT2 inhibitors, severity of comorbidities, blood routine indicators, as well as other factors such as length of hospital stay and estimated glomerular filtration rate (eGFR). Furthermore, the SHAP method was utilized to interpret the contribution of each feature to the model. And based on the optimal predictive model a user-friendly prediction platform for UTIs in T2DM was built to assist clinicians in making clinical decisions. The machine learning model-based prediction system developed in this study exhibited favorable predictive ability and promising clinical utility. The web-based prediction platform, combined with the professional judgment of clinicians, can assist to make better clinical decisions.
Real Estate Price Prediction Data
figshare.com
txt
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah (2024). Real Estate Price Prediction Data [Dataset]. http://doi.org/10.6084/m9.figshare.26517325.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26517325.v1
Dataset updated
Aug 8, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview: This dataset was collected and curated to support research on predicting real estate prices using machine learning algorithms, specifically Support Vector Regression (SVR) and Gradient Boosting Machine (GBM). The dataset includes comprehensive information on residential properties, enabling the development and evaluation of predictive models for accurate and transparent real estate appraisals.Data Source: The data was sourced from Department of Lands and Survey real estate listings.Features: The dataset contains the following key attributes for each property:Area (in square meters): The total living area of the property.Floor Number: The floor on which the property is located.Location: Geographic coordinates or city/region where the property is situated.Type of Apartment: The classification of the property, such as studio, one-bedroom, two-bedroom, etc.Number of Bathrooms: The total number of bathrooms in the property.Number of Bedrooms: The total number of bedrooms in the property.Property Age (in years): The number of years since the property was constructed.Property Condition: A categorical variable indicating the condition of the property (e.g., new, good, fair, needs renovation).Proximity to Amenities: The distance to nearby amenities such as schools, hospitals, shopping centers, and public transportation.Market Price (target variable): The actual sale price or listed price of the property.Data Preprocessing:Normalization: Numeric features such as area and proximity to amenities were normalized to ensure consistency and improve model performance.Categorical Encoding: Categorical features like property condition and type of apartment were encoded using one-hot encoding or label encoding, depending on the specific model requirements.Missing Values: Missing data points were handled using appropriate imputation techniques or by excluding records with significant missing information.Usage: This dataset was utilized to train and test machine learning models, aiming to predict the market price of residential properties based on the provided attributes. The models developed using this dataset demonstrated improved accuracy and transparency over traditional appraisal methods.Dataset Availability: The dataset is available for public use under the [CC BY 4.0]. Users are encouraged to cite the related publication when using the data in their research or applications.Citation: If you use this dataset in your research, please cite the following publication:[Real Estate Decision-Making: Precision in Price Prediction through Advanced Machine Learning Algorithms].
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

ZANG, LEIZHEN; Feng XIONG (2023). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF

Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/GQHURF

Dataset updated

Nov 22, 2023

Dataset provided by

Harvard Dataverse

Authors

ZANG, LEIZHEN; Feng XIONG

Description

Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.

Clear search

Close search

Google apps

Main menu

Replication Data for: Comparative investigation of time series missing data...

New Approach to Evaluating Supplementary Homicide Report (SHR) Data...

Retail Product Dataset with Missing Values

Descriptive statistics for original data and complete data after imputation...

Smartphones Dataset (August 2024)

Smartphones Dataset (August 2024)

Dataset Versions:

Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

Data from: Design-Based Causal Inference with Missing Outcomes: Missingness...

A Hybrid Educational Dataset

Context

Data Unification and Pre-processing

File Information

Instance

Feature

Sensitive Information

Acknowledgements

Inspiration

Factors associated with each domain of burnout among resident physicians...

📊 Telco Customer Churn Dataset

telecom #churn #classification #customer-analytics #data-cleaning #feature-engineering

Supplementary file 2_Machine learning enables early risk stratification of...

US traffic data with weather and calendar dataset

MedSec-25: IoMT Cybersecurity Dataset

Overview

Key Highlights:

Data Collection

Features

Labels

Usage

Hotel Reviews Dataset

Customer_Financial_Data

Table1_A personalized prediction model for urinary tract infections in type...

Real Estate Price Prediction Data

Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results