Facebook
TwitterMissing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
Facebook
TwitterThe purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics for original data and complete data after imputation in Experiment 2.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
- If you want to know about the web scrapping process then read the blog Medium Article
- If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo
GitHub Repo
This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.
Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.
Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.
This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Design-based causal inference, also known as randomization-based or finite-population causal inference, is one of the most widely used causal inference frameworks, largely due to the merit that its validity can be guaranteed by study design (e.g., randomized experiments) and does not require assuming specific outcome-generating distributions or super-population models. Despite its advantages, design-based causal inference can still suffer from other issues, among which outcome missingness is a prevalent and significant challenge. This work systematically studies the outcome missingness problem in design-based causal inference. First, we propose a general and flexible outcome missingness mechanism that can facilitate finite-population-exact randomization tests of no treatment effect. Second, under this general missingness mechanism, we propose a general framework called “imputation and re-imputation” for conducting randomization tests in design-based causal inference with missing outcomes. We prove that our framework can still ensure finite-population-exact Type-I error rate control even when the imputation model was misspecified or when unobserved covariates or interference exist in the missingness mechanism. Third, we extend our framework to conduct covariate adjustment in randomization tests and construct finite-population-valid confidence regions with missing outcomes. Our framework is evaluated via extensive simulation studies and applied to a large-scale randomized experiment. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.
This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:
The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.
The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.
A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.
A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:
Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.
One-Hot Encoding: All categorical features have been converted to a numerical format.
Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.
The result is a clean, comprehensive dataset ready for modeling.
Each row represents a student profile, and the columns are the features and the target.
Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).
The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.
Key Columns:
Target Variable:
had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset.
1: The student either failed (Fail) or withdrew (Withdrawn) from the course.
0: The student passed (Pass or Distinction).
Feature Groups:
OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE.
Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources.
Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics.
Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.
(Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)
This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:
OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset
UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance
This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:
Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).
Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).
Can you build separate models for each data origin (origem_dado_*) and compare ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Factors associated with each domain of burnout among resident physicians (multiple imputation, n = 296).
Facebook
Twitter📝 Dataset Description This dataset contains information about customers of a telecommunications company, including their demographic details, account information, service subscriptions, and churn status. It is a modified version of the popular Telco Churn dataset, curated for exploratory data analysis, machine learning model development, and churn prediction tasks.
The dataset includes simulated missing values in some columns to reflect real-world data issues and support preprocessing and imputation tasks. This makes it especially useful for demonstrating data cleaning techniques and evaluating model robustness.
📂 Files Included telco_data_modified.csv: The main dataset with 21 columns and 7043 rows (some missing values are intentionally inserted).
📌 Features Column Name Description customerID Unique identifier for each customer gender Customer gender: Male/Female SeniorCitizen Indicates if the customer is a senior citizen (0 = No, 1 = Yes) Partner Whether the customer has a partner Dependents Whether the customer has dependents tenure Number of months the customer has stayed with the company PhoneService Whether the customer has phone service MultipleLines Whether the customer has multiple lines InternetService Customer's internet service provider (DSL, Fiber optic, No) OnlineSecurity Whether the customer has online security OnlineBackup Whether the customer has online backup DeviceProtection Whether the customer has device protection TechSupport Whether the customer has tech support StreamingTV Whether the customer has streaming TV StreamingMovies Whether the customer has streaming movies Contract Type of contract: Month-to-month, One year, Two year PaperlessBilling Whether the customer uses paperless billing PaymentMethod Payment method: (e.g., Electronic check, Mailed check, etc.) MonthlyCharges Monthly charges TotalCharges Total charges to date Churn Whether the customer has left the company (Yes/No)
🔍 Use Cases Binary classification: Predict customer churn
Data preprocessing and imputation exercises
Feature engineering and importance analysis
Customer segmentation and churn modeling
⚠️ Notes Missing values were intentionally inserted in the dataset to help simulate real-world conditions.
Some preprocessing may be required before modeling (e.g., converting categorical to numerical data, handling TotalCharges as numeric).
🏷️ Tags
🙏 Acknowledgements This dataset is based on the original Telco Customer Churn dataset (initially provided by IBM). The current version has been modified for academic and practical exercises.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundHymenopteran stings (from bees, wasps, and hornets) can trigger severe systemic reactions, especially in tropical regions, risking patient safety and emergency care efficiency. Accurate early risk stratification is essential to guide timely intervention.ObjectiveTo develop and validate an interpretable machine learning model for early prediction of severe outcomes following hymenopteran stings.MethodsWe retrospectively analyzed 942 cases from a multicenter cohort in Hainan Province, China. Questionnaires with >20% missing data were excluded. Mean substitution was applied for primary missing data imputation, with multiple imputation by chained equations (MICE) used for sensitivity analysis. Seven supervised classifiers were trained using five-fold cross-validation; class imbalance was addressed using the adaptive synthetic sampling (ADASYN) algorithm. Model performance was evaluated via area under the receiver operating characteristic curve (AUC), recall, and precision, and feature importance was interpreted using Shapley additive explanations (SHAP) values.ResultsAmong 942 patients, 8.7% developed severe systemic complications. The distribution by species was: wasps (25.5%), honey bees (8.9%), and unknown species (65.6%). The optimal Extra Trees model achieved an AUC of 0.982, recall of 0.956, and precision of 0.926 in the held-out validation set. Key predictors included hypotension, dyspnea, altered mental status, elevated leukocyte counts, and abnormal creatinine levels. A web-based risk calculator was deployed for bedside application. Given the small number of high-risk cases, these high AUC values may overestimate real-world performance and require external validation.ConclusionWe developed an interpretable, deployable tool for early triage of hymenopteran sting patients in tropical settings. Emergency integration may improve clinical decisions and outcomes.
Facebook
TwitterThis dataset was collected for a research paper titled "Twitter-informed Prediction for Urban Traffic Flow Using Machine Learning," which is available online at https://ieeexplore.ieee.org/document/10185516. If you intend to use this dataset, we kindly request that you consider acknowledging our paper by including a citation. Your support in referencing our work would be greatly appreciated.
The traffic dataset was obtained through the California Performance Measurement System (PeMS) in the United States. It encompasses traffic data, including speed and flow information, for the eastbound lanes of the Ventura Highway in Los Angeles, covering the period from February 1 to May 31, 2020.
Calendar features in this dataset consist of weekdays, represented as numbers from 1 to 7, and a binary variable indicating whether a specific day is a holiday. Weather data was sourced from the Wunderground website (accessible at https://www.wunderground.com/history/daily/KLAX) throughout the study period. Weather data includes hourly observations of various meteorological factors. For consistency, we assume that weather conditions remain constant during each 5-minute time interval within an hour.
Weather conditions in the dataset include categories such as fair, blowing dust, cloudy, cloudy/windy, fair/windy, fog, haze, heavy rain, light rain, mostly cloudy, mostly cloudy/windy, partly cloudy/windy, rain, and thunder in the vicinity. Temperature is measured in Fahrenheit.
Missing data in this context refers to temporary disruptions in the availability of traffic information within specific areas of the transportation network due to sensor failures or noisy data. To address these missing values, we employed the mean imputation method.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MedSec-25 is a comprehensive, labeled network traffic dataset designed specifically for the Internet of Medical Things (IoMT) in healthcare environments. It addresses the limitations of existing generic IoT datasets by capturing realistic traffic from a custom-built healthcare IoT lab that mimics real-world hospital operations. The dataset includes both benign (normal) traffic and malicious traffic from multi-staged attack campaigns inspired by the MITRE ATT&CK framework. This allows for the development and evaluation of machine learning-based intrusion detection systems (IDS) tailored to IoMT scenarios, where patient safety and data privacy are critical. The dataset was generated using a variety of medical sensors (e.g., ECG, EEG, HHI, Respiration, SpO2) and environmental sensors (e.g., thermistor, ultrasonic, PIR, flame) connected via Raspberry Pi nodes and an IoT server. Traffic was captured over 7.5 hours using tools like Wireshark and tcpdump, resulting in PCAPNG files. These were processed with CICFlowMeter to extract flow-based features, producing a cleaned CSV dataset with 554,534 bidirectional network flows and 84 features.
Realistic Setup: Built in a physical lab at Rochester Institute of Technology, Dubai, incorporating diverse IoMT devices, protocols (e.g., MQTT, SSH, Telnet, FTP, HTTP, DNS), and real-time patient interactions (anonymized to comply with privacy regulations like HIPAA).
Multi-Staged Attacks: Unlike datasets focusing on isolated attacks, MedSec-25 simulates full attack chains: Reconnaissance (e.g., SYN/TCP scans, OS fingerprinting), Initial Access (e.g., brute-force, malformed MQTT packets), Lateral Movement (e.g., exploiting vulnerabilities to pivot between devices), and Exfiltration (e.g., data theft via MQTT).
Imbalanced Nature: This is the cleaned (imbalanced) version of the dataset. Users may need to apply balancing techniques (e.g., SMOTE oversampling + random undersampling) for model training, as demonstrated in the associated paper.
Size and Quality: 554,534 rows, no duplicates, no missing values (except 111 NaNs in Flow Byts/s, ~0.02%, which can be handled via imputation). Data types include float64 (45 columns), int64 (34 columns), and object (5 columns: Flow ID, Src IP, Dst IP, Timestamp, Label).
Utility: Preliminary models trained on this dataset (e.g., KNN: 98.09% accuracy, Decision Tree: 98.35% accuracy) show excellent performance for detecting attack stages.
This dataset is ideal for researchers in cybersecurity, machine learning, and healthcare IoT, enabling the creation of an IDS that can detect attacks at different phases to prevent escalation.
Benign Traffic: Generated over two days with active sensors, services (HTTP dashboard for patient monitoring, SSH/Telnet for remote access, FTP for file transfers), and real users (students/faculty) interacting with medical devices. No personally identifiable information was stored.
Malicious Traffic: Two Kali Linux attacker machines simulated MITRE ATT&CK-inspired campaigns using tools like Nmap, Scapy, Metasploit, and custom Python scripts.
Capture Tools: Wireshark and tcpdump for PCAPNG files (total ~1GB: 600MB benign, 400MB malicious).
Processing: Combined PCAP files per label, extracted features with CICFlowMeter, labeled flows manually based on attack phases, and cleaned for ML readiness. The final cleaned CSV is ~350MB.
The dataset includes 84 features extracted by CICFlowMeter, categorized as:
Identifiers: Flow ID, Src IP, Src Port, Dst IP, Dst Port, Protocol, Timestamp.
Time-Series Metrics: Flow Duration, Flow IAT Mean/Std/Max/Min, Fwd/Bwd IAT Tot/Mean/Std/Max/Min.
Size/Count Statistics: Tot Fwd/Bwd Pkts, TotLen Fwd/Bwd Pkts, Fwd/Bwd Pkt Len Max/Min/Mean/Std, Pkt Len Min/Max/Mean/Std/Var, Pkt Size Avg.
Flag Counts: Fwd/Bwd PSH/URG Flags, FIN/SYN/RST/PSH/ACK/URG/CWE/ECE Flag Cnt.
Rates and Ratios: Flow Byts/s, Flow Pkts/s, Fwd/Bwd Pkts/s, Down/Up Ratio, Active/Idle Mean/Std/Max/Min.
Segmentation and Others: Fwd/Bwd Seg Size Avg/Min, Subflow Fwd/Bwd Pkts/Byts, Init Fwd/Bwd Win Byts, Fwd Act Data Pkts, Fwd/Bwd Byts/b Avg, Fwd/Bwd Pkts/b Avg, Fwd/Bwd Blk Rate Avg.
The dataset is labeled with 5 classes representing benign behavior and attack stages:
Reconnaissance: 401,683 flows Initial Access: 102,090 flows Exfiltration: 25,915 flows Lateral Movement: 12,498 flows Benign: 12,348 flows
Note: The dataset is imbalanced, with Reconnaissance dominating. Apply balancing techniques for optimal ML performance.
Preprocessing Suggestions: Encode categorical features (e.g., Protocol, Label) using LabelEncoder. Normalize numerical features with Min-Max Scaler or StandardScaler. Handle the minor NaNs in Flow Byts/s via mean imputation.
Model Training: Split into train/test (e.g., 80/20). Suitable for classification tasks w...
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Context and Sources: This dataset comprises 10,000 hotel reviews collected from various online sources, including platforms like Hotels.com and TripAdvisor. Each entry contains detailed information about the review, the reviewer, and the hotel, providing valuable insights into customer satisfaction and preferences.
Inspiration: This dataset was created to facilitate the analysis of customer reviews in the hospitality industry. It can be used to study customer sentiments, identify trends, and improve hotel services by understanding the key factors that contribute to customer satisfaction.
Use Cases: Sentiment Analysis: Analyze the sentiment of reviews to determine customer satisfaction and identify areas for improvement. Trend Analysis: Identify common themes and trends in customer feedback over time. Recommender Systems: Use the data to build systems that suggest hotels based on user preferences and review patterns. Market Research: Understand customer preferences and competitive positioning within the hotel industry. Dataset Overview: Number of Rows: 10,000 Number of Columns: 25 Key Columns: reviews.text: The text of the review, offering qualitative insights into customer experiences. reviews.rating: The rating given by the reviewer, typically on a scale from 1 to 5. city, country: Geographical location of the hotel, enabling region-specific analysis. reviews.username: The username of the reviewer, which can be used to study review patterns and behaviors. reviews.date: The date the review was written, useful for temporal analysis. Potential Challenges: Missing Data: Some columns like reviews.userCity and reviews.userProvince have missing values, which may require imputation or exclusion during analysis. Data Imbalance: The distribution of ratings might be skewed, which could affect sentiment analysis or other predictive modeling tasks. This dataset is well-suited for various applications in the fields of natural language processing, machine learning, and data analysis within the hospitality industry.
Facebook
TwitterContext This dataset contains detailed, anonymized information about a bank's customers. It includes demographic data such as age, income, and family size, as well as financial information like mortgage value, credit card ownership, and average spending habits. The data is well-suited for a variety of machine learning tasks, particularly in the domain of financial services and marketing.
Content The dataset consists of 5000 customer records with 14 attributes:
Data Quality Note Some rows contain negative values for the Years_Experience column. This is a data quality issue that may require preprocessing (e.g., imputation by taking the absolute value or using the average of similar age groups).
Potential Use Cases This dataset is excellent for both educational and practical purposes. You can use it to:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Patients with type 2 diabetes mellitus (T2DM) are at higher risk for urinary tract infections (UTIs), which greatly impacts their quality of life. Developing a risk prediction model to identify high-risk patients for UTIs in those with T2DM and assisting clinical decision-making can help reduce the incidence of UTIs in T2DM patients. To construct the predictive model, potential relevant variables were first selected from the reference literature, and then data was extracted from the Hospital Information System (HIS) of the Sichuan Academy of Medical Sciences and Sichuan Provincial People’s Hospital for analysis. The data set was split into a training set and a test set in an 8:2 ratio. To handle the data and establish risk warning models, four imputation methods, four balancing methods, three feature screening methods, and eighteen machine learning algorithms were employed. A 10-fold cross-validation technique was applied to internally validate the training set, while the bootstrap method was used for external validation in the test set. The area under the receiver operating characteristic curve (AUC) and decision curve analysis (DCA) were used to evaluate the performance of the models. The contributions of features were interpreted using the SHapley Additive ExPlanation (SHAP) approach. And a web-based prediction platform for UTIs in T2DM was constructed by Flask framework. Finally, 106 variables were identified for analysis from a total of 119 literature sources, and 1340 patients were included in the study. After comprehensive data preprocessing, a total of 48 datasets were generated, and 864 risk warning models were constructed based on various balancing methods, feature selection techniques, and a range of machine learning algorithms. The receiver operating characteristic (ROC) curves were used to assess the performances of these models, and the best model achieved an impressive AUC of 0.9789 upon external validation. Notably, the most critical factors contributing to UTIs in T2DM patients were found to be UTIs-related inflammatory markers, medication use, mainly SGLT2 inhibitors, severity of comorbidities, blood routine indicators, as well as other factors such as length of hospital stay and estimated glomerular filtration rate (eGFR). Furthermore, the SHAP method was utilized to interpret the contribution of each feature to the model. And based on the optimal predictive model a user-friendly prediction platform for UTIs in T2DM was built to assist clinicians in making clinical decisions. The machine learning model-based prediction system developed in this study exhibited favorable predictive ability and promising clinical utility. The web-based prediction platform, combined with the professional judgment of clinicians, can assist to make better clinical decisions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This dataset was collected and curated to support research on predicting real estate prices using machine learning algorithms, specifically Support Vector Regression (SVR) and Gradient Boosting Machine (GBM). The dataset includes comprehensive information on residential properties, enabling the development and evaluation of predictive models for accurate and transparent real estate appraisals.Data Source: The data was sourced from Department of Lands and Survey real estate listings.Features: The dataset contains the following key attributes for each property:Area (in square meters): The total living area of the property.Floor Number: The floor on which the property is located.Location: Geographic coordinates or city/region where the property is situated.Type of Apartment: The classification of the property, such as studio, one-bedroom, two-bedroom, etc.Number of Bathrooms: The total number of bathrooms in the property.Number of Bedrooms: The total number of bedrooms in the property.Property Age (in years): The number of years since the property was constructed.Property Condition: A categorical variable indicating the condition of the property (e.g., new, good, fair, needs renovation).Proximity to Amenities: The distance to nearby amenities such as schools, hospitals, shopping centers, and public transportation.Market Price (target variable): The actual sale price or listed price of the property.Data Preprocessing:Normalization: Numeric features such as area and proximity to amenities were normalized to ensure consistency and improve model performance.Categorical Encoding: Categorical features like property condition and type of apartment were encoded using one-hot encoding or label encoding, depending on the specific model requirements.Missing Values: Missing data points were handled using appropriate imputation techniques or by excluding records with significant missing information.Usage: This dataset was utilized to train and test machine learning models, aiming to predict the market price of residential properties based on the provided attributes. The models developed using this dataset demonstrated improved accuracy and transparency over traditional appraisal methods.Dataset Availability: The dataset is available for public use under the [CC BY 4.0]. Users are encouraged to cite the related publication when using the data in their research or applications.Citation: If you use this dataset in your research, please cite the following publication:[Real Estate Decision-Making: Precision in Price Prediction through Advanced Machine Learning Algorithms].
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMissing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.