16 datasets found
  1. d

    Replication Data for: Comparative investigation of time series missing data...

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZANG, LEIZHEN; Feng XIONG (2023). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    ZANG, LEIZHEN; Feng XIONG
    Description

    Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.

  2. d

    New Approach to Evaluating Supplementary Homicide Report (SHR) Data...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 [Dataset]. https://catalog.data.gov/dataset/new-approach-to-evaluating-supplementary-homicide-report-shr-data-imputation-1990-1995-ff769
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    National Institute of Justice
    Description

    The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.

  3. Retail Product Dataset with Missing Values

    • kaggle.com
    zip
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
    Explore at:
    zip(47826 bytes)Available download formats
    Dataset updated
    Feb 17, 2025
    Authors
    Himel Sarder
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

    The dataset includes:
    - Category (Categorical): Product category (A, B, C, D)
    - Price (Numerical): Randomized product prices
    - Rating (Numerical): Ratings between 1 to 5
    - Stock (Categorical): Availability status (In Stock, Out of Stock)
    - Discount (Numerical): Discount percentage

    This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.

  4. Descriptive statistics for original data and complete data after imputation...

    • plos.figshare.com
    xls
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Chiara Liverani; Eleni Kalogirou; Catherine Rivier; Edouard Gentaz (2023). Descriptive statistics for original data and complete data after imputation in Experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0289027.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Maria Chiara Liverani; Eleni Kalogirou; Catherine Rivier; Edouard Gentaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptive statistics for original data and complete data after imputation in Experiment 2.

  5. Smartphones Dataset (August 2024)

    • kaggle.com
    zip
    Updated Aug 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilkush Singh (2024). Smartphones Dataset (August 2024) [Dataset]. https://www.kaggle.com/datasets/dilkushsingh/smartphones-dataset-upto-july24
    Explore at:
    zip(605033 bytes)Available download formats
    Dataset updated
    Aug 24, 2024
    Authors
    Dilkush Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Smartphones Dataset (August 2024)

    This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
    - If you want to know about the web scrapping process then read the blog Medium Article - If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo GitHub Repo

    Dataset Versions:

    Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

    This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.

    Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

    Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.

    Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

    Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.

    Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

    This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.

  6. Data from: Design-Based Causal Inference with Missing Outcomes: Missingness...

    • tandf.figshare.com
    zip
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siyu Heng; Jiawei Zhang; Yang Feng (2025). Design-Based Causal Inference with Missing Outcomes: Missingness Mechanisms, Imputation-Assisted Randomization Tests, and Covariate Adjustment [Dataset]. http://doi.org/10.6084/m9.figshare.29356876.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Siyu Heng; Jiawei Zhang; Yang Feng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Design-based causal inference, also known as randomization-based or finite-population causal inference, is one of the most widely used causal inference frameworks, largely due to the merit that its validity can be guaranteed by study design (e.g., randomized experiments) and does not require assuming specific outcome-generating distributions or super-population models. Despite its advantages, design-based causal inference can still suffer from other issues, among which outcome missingness is a prevalent and significant challenge. This work systematically studies the outcome missingness problem in design-based causal inference. First, we propose a general and flexible outcome missingness mechanism that can facilitate finite-population-exact randomization tests of no treatment effect. Second, under this general missingness mechanism, we propose a general framework called “imputation and re-imputation” for conducting randomization tests in design-based causal inference with missing outcomes. We prove that our framework can still ensure finite-population-exact Type-I error rate control even when the imputation model was misspecified or when unobserved covariates or interference exist in the missingness mechanism. Third, we extend our framework to conduct covariate adjustment in randomization tests and construct finite-population-valid confidence regions with missing outcomes. Our framework is evaluated via extensive simulation studies and applied to a large-scale randomized experiment. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

  7. A Hybrid Educational Dataset

    • kaggle.com
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanoel Carvalho Lopes (2025). A Hybrid Educational Dataset [Dataset]. https://www.kaggle.com/datasets/emanoelcarvalholopes/uci-oulad-sintetico-unificados
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Emanoel Carvalho Lopes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.

    This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:

    The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.

    The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.

    A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.

    Data Unification and Pre-processing

    A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:

    • Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.

    • One-Hot Encoding: All categorical features have been converted to a numerical format.

    • Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.

    The result is a clean, comprehensive dataset ready for modeling.

    File Information

    Instance

    Each row represents a student profile, and the columns are the features and the target.

    Feature

    Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).

    Sensitive Information

    The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.

    Key Columns:

    Target Variable:
    
      had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset.
    
        1: The student either failed (Fail) or withdrew (Withdrawn) from the course.
    
        0: The student passed (Pass or Distinction).
    
    Feature Groups:
    
      OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE.
    
      Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources.
    
      Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics.
    
      Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.
    

    (Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)

    Acknowledgements

    This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:

    OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset
    
    UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance
    

    Inspiration

    This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:

    Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).
    
    • Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).

    • Can you build separate models for each data origin (origem_dado_*) and compare ...

  8. Factors associated with each domain of burnout among resident physicians...

    • plos.figshare.com
    xls
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vithawat Surawattanasakul; Penprapa Siviroj; Wuttipat Kiratipaisarl (2024). Factors associated with each domain of burnout among resident physicians (multiple imputation, n = 296). [Dataset]. http://doi.org/10.1371/journal.pone.0312839.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Vithawat Surawattanasakul; Penprapa Siviroj; Wuttipat Kiratipaisarl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Factors associated with each domain of burnout among resident physicians (multiple imputation, n = 296).

  9. 📊 Telco Customer Churn Dataset

    • kaggle.com
    zip
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin Kleon (2025). 📊 Telco Customer Churn Dataset [Dataset]. https://www.kaggle.com/datasets/jethwaaatmik/telco-customer-churn-dataset
    Explore at:
    zip(172687 bytes)Available download formats
    Dataset updated
    Jul 18, 2025
    Authors
    Austin Kleon
    Description

    📝 Dataset Description This dataset contains information about customers of a telecommunications company, including their demographic details, account information, service subscriptions, and churn status. It is a modified version of the popular Telco Churn dataset, curated for exploratory data analysis, machine learning model development, and churn prediction tasks.

    The dataset includes simulated missing values in some columns to reflect real-world data issues and support preprocessing and imputation tasks. This makes it especially useful for demonstrating data cleaning techniques and evaluating model robustness.

    📂 Files Included telco_data_modified.csv: The main dataset with 21 columns and 7043 rows (some missing values are intentionally inserted).

    📌 Features Column Name Description customerID Unique identifier for each customer gender Customer gender: Male/Female SeniorCitizen Indicates if the customer is a senior citizen (0 = No, 1 = Yes) Partner Whether the customer has a partner Dependents Whether the customer has dependents tenure Number of months the customer has stayed with the company PhoneService Whether the customer has phone service MultipleLines Whether the customer has multiple lines InternetService Customer's internet service provider (DSL, Fiber optic, No) OnlineSecurity Whether the customer has online security OnlineBackup Whether the customer has online backup DeviceProtection Whether the customer has device protection TechSupport Whether the customer has tech support StreamingTV Whether the customer has streaming TV StreamingMovies Whether the customer has streaming movies Contract Type of contract: Month-to-month, One year, Two year PaperlessBilling Whether the customer uses paperless billing PaymentMethod Payment method: (e.g., Electronic check, Mailed check, etc.) MonthlyCharges Monthly charges TotalCharges Total charges to date Churn Whether the customer has left the company (Yes/No)

    🔍 Use Cases Binary classification: Predict customer churn

    Data preprocessing and imputation exercises

    Feature engineering and importance analysis

    Customer segmentation and churn modeling

    ⚠️ Notes Missing values were intentionally inserted in the dataset to help simulate real-world conditions.

    Some preprocessing may be required before modeling (e.g., converting categorical to numerical data, handling TotalCharges as numeric).

    🏷️ Tags

    telecom #churn #classification #customer-analytics #data-cleaning #feature-engineering

    🙏 Acknowledgements This dataset is based on the original Telco Customer Churn dataset (initially provided by IBM). The current version has been modified for academic and practical exercises.

  10. f

    Supplementary file 2_Machine learning enables early risk stratification of...

    • figshare.com
    xlsx
    Updated Oct 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feng Han; Yuanshui Liu; Huamei Li; Xiaofang Chen; Liqiu Liang; Dongchuan Xu; Lijiao Ye; Yanhong Ouyang; Ping He; Wang Liao (2025). Supplementary file 2_Machine learning enables early risk stratification of hymenopteran stings: evidence from a tropical multicenter cohort.xlsx [Dataset]. http://doi.org/10.3389/fpubh.2025.1664606.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 28, 2025
    Dataset provided by
    Frontiers
    Authors
    Feng Han; Yuanshui Liu; Huamei Li; Xiaofang Chen; Liqiu Liang; Dongchuan Xu; Lijiao Ye; Yanhong Ouyang; Ping He; Wang Liao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundHymenopteran stings (from bees, wasps, and hornets) can trigger severe systemic reactions, especially in tropical regions, risking patient safety and emergency care efficiency. Accurate early risk stratification is essential to guide timely intervention.ObjectiveTo develop and validate an interpretable machine learning model for early prediction of severe outcomes following hymenopteran stings.MethodsWe retrospectively analyzed 942 cases from a multicenter cohort in Hainan Province, China. Questionnaires with >20% missing data were excluded. Mean substitution was applied for primary missing data imputation, with multiple imputation by chained equations (MICE) used for sensitivity analysis. Seven supervised classifiers were trained using five-fold cross-validation; class imbalance was addressed using the adaptive synthetic sampling (ADASYN) algorithm. Model performance was evaluated via area under the receiver operating characteristic curve (AUC), recall, and precision, and feature importance was interpreted using Shapley additive explanations (SHAP) values.ResultsAmong 942 patients, 8.7% developed severe systemic complications. The distribution by species was: wasps (25.5%), honey bees (8.9%), and unknown species (65.6%). The optimal Extra Trees model achieved an AUC of 0.982, recall of 0.956, and precision of 0.926 in the held-out validation set. Key predictors included hypotension, dyspnea, altered mental status, elevated leukocyte counts, and abnormal creatinine levels. A web-based risk calculator was deployed for bedside application. Given the small number of high-risk cases, these high AUC values may overestimate real-world performance and require external validation.ConclusionWe developed an interpretable, deployable tool for early triage of hymenopteran sting patients in tropical settings. Emergency integration may improve clinical decisions and outcomes.

  11. US traffic data with weather and calendar dataset

    • kaggle.com
    zip
    Updated Aug 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryam Shoaei (2023). US traffic data with weather and calendar dataset [Dataset]. https://www.kaggle.com/datasets/maryamshoaei/us-traffic-data-with-weather-and-calendar-dataset/data
    Explore at:
    zip(367139 bytes)Available download formats
    Dataset updated
    Aug 25, 2023
    Authors
    Maryam Shoaei
    Area covered
    United States
    Description

    This dataset was collected for a research paper titled "Twitter-informed Prediction for Urban Traffic Flow Using Machine Learning," which is available online at https://ieeexplore.ieee.org/document/10185516. If you intend to use this dataset, we kindly request that you consider acknowledging our paper by including a citation. Your support in referencing our work would be greatly appreciated.

    The traffic dataset was obtained through the California Performance Measurement System (PeMS) in the United States. It encompasses traffic data, including speed and flow information, for the eastbound lanes of the Ventura Highway in Los Angeles, covering the period from February 1 to May 31, 2020.

    Calendar features in this dataset consist of weekdays, represented as numbers from 1 to 7, and a binary variable indicating whether a specific day is a holiday. Weather data was sourced from the Wunderground website (accessible at https://www.wunderground.com/history/daily/KLAX) throughout the study period. Weather data includes hourly observations of various meteorological factors. For consistency, we assume that weather conditions remain constant during each 5-minute time interval within an hour.

    Weather conditions in the dataset include categories such as fair, blowing dust, cloudy, cloudy/windy, fair/windy, fog, haze, heavy rain, light rain, mostly cloudy, mostly cloudy/windy, partly cloudy/windy, rain, and thunder in the vicinity. Temperature is measured in Fahrenheit.

    Missing data in this context refers to temporary disruptions in the availability of traffic information within specific areas of the transportation network due to sensor failures or noisy data. To address these missing values, we employed the mean imputation method.

  12. MedSec-25: IoMT Cybersecurity Dataset

    • kaggle.com
    zip
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Abdullah (2025). MedSec-25: IoMT Cybersecurity Dataset [Dataset]. https://www.kaggle.com/datasets/abdullah001234/medsec-25-iomt-cybersecurity-dataset
    Explore at:
    zip(38496221 bytes)Available download formats
    Dataset updated
    Sep 8, 2025
    Authors
    Muhammad Abdullah
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    MedSec-25 is a comprehensive, labeled network traffic dataset designed specifically for the Internet of Medical Things (IoMT) in healthcare environments. It addresses the limitations of existing generic IoT datasets by capturing realistic traffic from a custom-built healthcare IoT lab that mimics real-world hospital operations. The dataset includes both benign (normal) traffic and malicious traffic from multi-staged attack campaigns inspired by the MITRE ATT&CK framework. This allows for the development and evaluation of machine learning-based intrusion detection systems (IDS) tailored to IoMT scenarios, where patient safety and data privacy are critical. The dataset was generated using a variety of medical sensors (e.g., ECG, EEG, HHI, Respiration, SpO2) and environmental sensors (e.g., thermistor, ultrasonic, PIR, flame) connected via Raspberry Pi nodes and an IoT server. Traffic was captured over 7.5 hours using tools like Wireshark and tcpdump, resulting in PCAPNG files. These were processed with CICFlowMeter to extract flow-based features, producing a cleaned CSV dataset with 554,534 bidirectional network flows and 84 features.

    Key Highlights:

    Realistic Setup: Built in a physical lab at Rochester Institute of Technology, Dubai, incorporating diverse IoMT devices, protocols (e.g., MQTT, SSH, Telnet, FTP, HTTP, DNS), and real-time patient interactions (anonymized to comply with privacy regulations like HIPAA).

    Multi-Staged Attacks: Unlike datasets focusing on isolated attacks, MedSec-25 simulates full attack chains: Reconnaissance (e.g., SYN/TCP scans, OS fingerprinting), Initial Access (e.g., brute-force, malformed MQTT packets), Lateral Movement (e.g., exploiting vulnerabilities to pivot between devices), and Exfiltration (e.g., data theft via MQTT).

    Imbalanced Nature: This is the cleaned (imbalanced) version of the dataset. Users may need to apply balancing techniques (e.g., SMOTE oversampling + random undersampling) for model training, as demonstrated in the associated paper.

    Size and Quality: 554,534 rows, no duplicates, no missing values (except 111 NaNs in Flow Byts/s, ~0.02%, which can be handled via imputation). Data types include float64 (45 columns), int64 (34 columns), and object (5 columns: Flow ID, Src IP, Dst IP, Timestamp, Label).

    Utility: Preliminary models trained on this dataset (e.g., KNN: 98.09% accuracy, Decision Tree: 98.35% accuracy) show excellent performance for detecting attack stages.

    This dataset is ideal for researchers in cybersecurity, machine learning, and healthcare IoT, enabling the creation of an IDS that can detect attacks at different phases to prevent escalation.

    Data Collection

    Benign Traffic: Generated over two days with active sensors, services (HTTP dashboard for patient monitoring, SSH/Telnet for remote access, FTP for file transfers), and real users (students/faculty) interacting with medical devices. No personally identifiable information was stored.

    Malicious Traffic: Two Kali Linux attacker machines simulated MITRE ATT&CK-inspired campaigns using tools like Nmap, Scapy, Metasploit, and custom Python scripts.

    Capture Tools: Wireshark and tcpdump for PCAPNG files (total ~1GB: 600MB benign, 400MB malicious).

    Processing: Combined PCAP files per label, extracted features with CICFlowMeter, labeled flows manually based on attack phases, and cleaned for ML readiness. The final cleaned CSV is ~350MB.

    Features

    The dataset includes 84 features extracted by CICFlowMeter, categorized as:

    Identifiers: Flow ID, Src IP, Src Port, Dst IP, Dst Port, Protocol, Timestamp.

    Time-Series Metrics: Flow Duration, Flow IAT Mean/Std/Max/Min, Fwd/Bwd IAT Tot/Mean/Std/Max/Min.

    Size/Count Statistics: Tot Fwd/Bwd Pkts, TotLen Fwd/Bwd Pkts, Fwd/Bwd Pkt Len Max/Min/Mean/Std, Pkt Len Min/Max/Mean/Std/Var, Pkt Size Avg.

    Flag Counts: Fwd/Bwd PSH/URG Flags, FIN/SYN/RST/PSH/ACK/URG/CWE/ECE Flag Cnt.

    Rates and Ratios: Flow Byts/s, Flow Pkts/s, Fwd/Bwd Pkts/s, Down/Up Ratio, Active/Idle Mean/Std/Max/Min.

    Segmentation and Others: Fwd/Bwd Seg Size Avg/Min, Subflow Fwd/Bwd Pkts/Byts, Init Fwd/Bwd Win Byts, Fwd Act Data Pkts, Fwd/Bwd Byts/b Avg, Fwd/Bwd Pkts/b Avg, Fwd/Bwd Blk Rate Avg.

    Labels

    The dataset is labeled with 5 classes representing benign behavior and attack stages:

    Reconnaissance: 401,683 flows Initial Access: 102,090 flows Exfiltration: 25,915 flows Lateral Movement: 12,498 flows Benign: 12,348 flows

    Note: The dataset is imbalanced, with Reconnaissance dominating. Apply balancing techniques for optimal ML performance.

    Usage

    Preprocessing Suggestions: Encode categorical features (e.g., Protocol, Label) using LabelEncoder. Normalize numerical features with Min-Max Scaler or StandardScaler. Handle the minor NaNs in Flow Byts/s via mean imputation.

    Model Training: Split into train/test (e.g., 80/20). Suitable for classification tasks w...

  13. Hotel Reviews Dataset

    • kaggle.com
    zip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waseem AlAstal (2024). Hotel Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/waseemalastal/hotel-reviews-dataset
    Explore at:
    zip(3410051 bytes)Available download formats
    Dataset updated
    Aug 27, 2024
    Authors
    Waseem AlAstal
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context and Sources: This dataset comprises 10,000 hotel reviews collected from various online sources, including platforms like Hotels.com and TripAdvisor. Each entry contains detailed information about the review, the reviewer, and the hotel, providing valuable insights into customer satisfaction and preferences.

    Inspiration: This dataset was created to facilitate the analysis of customer reviews in the hospitality industry. It can be used to study customer sentiments, identify trends, and improve hotel services by understanding the key factors that contribute to customer satisfaction.

    Use Cases: Sentiment Analysis: Analyze the sentiment of reviews to determine customer satisfaction and identify areas for improvement. Trend Analysis: Identify common themes and trends in customer feedback over time. Recommender Systems: Use the data to build systems that suggest hotels based on user preferences and review patterns. Market Research: Understand customer preferences and competitive positioning within the hotel industry. Dataset Overview: Number of Rows: 10,000 Number of Columns: 25 Key Columns: reviews.text: The text of the review, offering qualitative insights into customer experiences. reviews.rating: The rating given by the reviewer, typically on a scale from 1 to 5. city, country: Geographical location of the hotel, enabling region-specific analysis. reviews.username: The username of the reviewer, which can be used to study review patterns and behaviors. reviews.date: The date the review was written, useful for temporal analysis. Potential Challenges: Missing Data: Some columns like reviews.userCity and reviews.userProvince have missing values, which may require imputation or exclusion during analysis. Data Imbalance: The distribution of ratings might be skewed, which could affect sentiment analysis or other predictive modeling tasks. This dataset is well-suited for various applications in the fields of natural language processing, machine learning, and data analysis within the hospitality industry.

  14. Customer_Financial_Data

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashob Narendran (2025). Customer_Financial_Data [Dataset]. https://www.kaggle.com/datasets/prashobnarendran/customer-financial-data
    Explore at:
    zip(62099 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    Prashob Narendran
    Description

    Context This dataset contains detailed, anonymized information about a bank's customers. It includes demographic data such as age, income, and family size, as well as financial information like mortgage value, credit card ownership, and average spending habits. The data is well-suited for a variety of machine learning tasks, particularly in the domain of financial services and marketing.

    Content The dataset consists of 5000 customer records with 14 attributes:

    • Customer_ID: A unique identifier for each customer.
    • Age: The customer's age in completed years.
    • Years_Experience: Years of professional experience.
    • Annual_Income: Annual income of the customer (in thousands of dollars).
    • ZIP_Code: The customer's home address ZIP code.
    • Family_size: The number of individuals in the customer's family.
    • Avg_Spending: Average monthly spending on credit cards (in thousands of dollars).
    • Education_Level: A categorical variable for education level (1: Undergraduate, 2: Graduate, 3: Advanced/Professional).
    • Mortgage: The value of the customer's house mortgage if any (in thousands of dollars).
    • Has_Consumer_Loan: Binary variable indicating if the customer accepted a personal loan in the last campaign (1: Yes, 0: No). This is a potential target variable.
    • Has_Securities_Account: Binary variable indicating if the customer has a securities account with the bank.
    • Has_CD_Account: Binary variable indicating if the customer has a certificate of deposit (CD) account with the bank.
    • Uses_Online_Banking: Binary variable indicating if the customer uses online banking services.
    • Has_CreditCard: Binary variable indicating if the customer uses a credit card issued by this bank.

    Data Quality Note Some rows contain negative values for the Years_Experience column. This is a data quality issue that may require preprocessing (e.g., imputation by taking the absolute value or using the average of similar age groups).

    Potential Use Cases This dataset is excellent for both educational and practical purposes. You can use it to:

    1. Predict Loan Acceptance: Build a classification model to predict which customers are most likely to accept a personal loan (Has_Consumer_Loan).
    2. Customer Segmentation: Use clustering algorithms (like K-Means) to identify distinct customer segments for targeted marketing campaigns.
    3. Credit Card Adoption: Analyze the factors that influence a customer's decision to get a bank-issued credit card.
    4. Exploratory Data Analysis (EDA): Practice your data analysis and visualization skills to uncover insights about customer behavior.
  15. Table1_A personalized prediction model for urinary tract infections in type...

    • frontiersin.figshare.com
    docx
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Xiong; Yu-Meng Liu; Jia-Qiang Hu; Bao-Qiang Zhu; Yuan-Kui Wei; Yan Yang; Xing-Wei Wu; En-Wu Long (2024). Table1_A personalized prediction model for urinary tract infections in type 2 diabetes mellitus using machine learning.DOCX [Dataset]. http://doi.org/10.3389/fphar.2023.1259596.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Yu Xiong; Yu-Meng Liu; Jia-Qiang Hu; Bao-Qiang Zhu; Yuan-Kui Wei; Yan Yang; Xing-Wei Wu; En-Wu Long
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Patients with type 2 diabetes mellitus (T2DM) are at higher risk for urinary tract infections (UTIs), which greatly impacts their quality of life. Developing a risk prediction model to identify high-risk patients for UTIs in those with T2DM and assisting clinical decision-making can help reduce the incidence of UTIs in T2DM patients. To construct the predictive model, potential relevant variables were first selected from the reference literature, and then data was extracted from the Hospital Information System (HIS) of the Sichuan Academy of Medical Sciences and Sichuan Provincial People’s Hospital for analysis. The data set was split into a training set and a test set in an 8:2 ratio. To handle the data and establish risk warning models, four imputation methods, four balancing methods, three feature screening methods, and eighteen machine learning algorithms were employed. A 10-fold cross-validation technique was applied to internally validate the training set, while the bootstrap method was used for external validation in the test set. The area under the receiver operating characteristic curve (AUC) and decision curve analysis (DCA) were used to evaluate the performance of the models. The contributions of features were interpreted using the SHapley Additive ExPlanation (SHAP) approach. And a web-based prediction platform for UTIs in T2DM was constructed by Flask framework. Finally, 106 variables were identified for analysis from a total of 119 literature sources, and 1340 patients were included in the study. After comprehensive data preprocessing, a total of 48 datasets were generated, and 864 risk warning models were constructed based on various balancing methods, feature selection techniques, and a range of machine learning algorithms. The receiver operating characteristic (ROC) curves were used to assess the performances of these models, and the best model achieved an impressive AUC of 0.9789 upon external validation. Notably, the most critical factors contributing to UTIs in T2DM patients were found to be UTIs-related inflammatory markers, medication use, mainly SGLT2 inhibitors, severity of comorbidities, blood routine indicators, as well as other factors such as length of hospital stay and estimated glomerular filtration rate (eGFR). Furthermore, the SHAP method was utilized to interpret the contribution of each feature to the model. And based on the optimal predictive model a user-friendly prediction platform for UTIs in T2DM was built to assist clinicians in making clinical decisions. The machine learning model-based prediction system developed in this study exhibited favorable predictive ability and promising clinical utility. The web-based prediction platform, combined with the professional judgment of clinicians, can assist to make better clinical decisions.

  16. Real Estate Price Prediction Data

    • figshare.com
    txt
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah (2024). Real Estate Price Prediction Data [Dataset]. http://doi.org/10.6084/m9.figshare.26517325.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mohammad Shbool; Rand Al-Dmour; Bashar Al-Shboul; Nibal Albashabsheh; Najat Almasarwah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview: This dataset was collected and curated to support research on predicting real estate prices using machine learning algorithms, specifically Support Vector Regression (SVR) and Gradient Boosting Machine (GBM). The dataset includes comprehensive information on residential properties, enabling the development and evaluation of predictive models for accurate and transparent real estate appraisals.Data Source: The data was sourced from Department of Lands and Survey real estate listings.Features: The dataset contains the following key attributes for each property:Area (in square meters): The total living area of the property.Floor Number: The floor on which the property is located.Location: Geographic coordinates or city/region where the property is situated.Type of Apartment: The classification of the property, such as studio, one-bedroom, two-bedroom, etc.Number of Bathrooms: The total number of bathrooms in the property.Number of Bedrooms: The total number of bedrooms in the property.Property Age (in years): The number of years since the property was constructed.Property Condition: A categorical variable indicating the condition of the property (e.g., new, good, fair, needs renovation).Proximity to Amenities: The distance to nearby amenities such as schools, hospitals, shopping centers, and public transportation.Market Price (target variable): The actual sale price or listed price of the property.Data Preprocessing:Normalization: Numeric features such as area and proximity to amenities were normalized to ensure consistency and improve model performance.Categorical Encoding: Categorical features like property condition and type of apartment were encoded using one-hot encoding or label encoding, depending on the specific model requirements.Missing Values: Missing data points were handled using appropriate imputation techniques or by excluding records with significant missing information.Usage: This dataset was utilized to train and test machine learning models, aiming to predict the market price of residential properties based on the provided attributes. The models developed using this dataset demonstrated improved accuracy and transparency over traditional appraisal methods.Dataset Availability: The dataset is available for public use under the [CC BY 4.0]. Users are encouraged to cite the related publication when using the data in their research or applications.Citation: If you use this dataset in your research, please cite the following publication:[Real Estate Decision-Making: Precision in Price Prediction through Advanced Machine Learning Algorithms].

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ZANG, LEIZHEN; Feng XIONG (2023). Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results [Dataset]. http://doi.org/10.7910/DVN/GQHURF

Replication Data for: Comparative investigation of time series missing data imputation in political science: Different methods, different results

Explore at:
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
ZANG, LEIZHEN; Feng XIONG
Description

Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.

Search
Clear search
Close search
Google apps
Main menu