Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.
Occupation data for 2021 and 2022 data files
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
Latest edition information
For the third edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites vulnerable to CWE-230, compiled through global website indexing conducted by WebTechSurvey.
The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
[ENG] This dataset contains the data used in the tutorial article "Handling Planned and Unplanned Missing Data in a Longitudinal Study", in press at "The Quantitative Methods for Psychology". It contains a subset of longitudinal data collected within the context of a survey about COVID-19 (data on sleep and emotions). This dataset is intended for tutorial purposes only. With the observations and variables in this dataset, the analyses presented in the tutorial can be reproduced. For more information, see de la Sablonnière et al. (2020). [FRE] Ce jeu de données contient les données utilisées dans l'article tutoriel "Handling Planned and Unplanned Missing Data in a Longitudinal Study", sous presse à "The Quantitative Methods for Psychology". Il contient un sous-ensemble de données longitudinales collectées dans le cadre d'une enquête sur le COVID-19 (données sur le sommeil et les émotions). Cet ensemble de données est destiné à des fins didactiques uniquement. Avec les observations et les variables de ce jeu de données, les analyses présentées dans le tutoriel peuvent être reproduites. Pour plus d'informations, voir de la Sablonnière et al. (2020).
Code for analysis of missing data
https://cdn.pixabay.com/photo/2013/11/20/18/51/autos-214033_1280.jpg" alt="image-2.png">
The Autoscout dataset, available on Kaggle, provides comprehensive information about vehicles listed for sale. This dataset includes a variety of attributes detailing each vehicle, which is essential for conducting detailed analyses of the automotive market.
In Part 2: Handling Missing Values, the dataset has undergone rigorous cleaning to address and resolve missing values across several columns. This cleaning process ensures that the data is accurate, complete, and ready for analysis.
This cleaning process is crucial for ensuring the dataset's quality and reliability, facilitating accurate analysis and insights.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.
This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.
Column Name | Description | Example Values |
---|---|---|
Order ID | A unique identifier for each order. | ORD_123456 |
Customer ID | A unique identifier for each customer. | CUST_001 |
Category | The category of the purchased item. | Main Dishes , Drinks |
Item | The name of the purchased item. May contain missing values due to data dirt. | Grilled Chicken , None |
Price | The static price of the item. May contain missing values. | 15.0 , None |
Quantity | The quantity of the purchased item. May contain missing values. | 1 , None |
Order Total | The total price for the order (Price * Quantity ). May contain missing values. | 45.0 , None |
Order Date | The date when the order was placed. Always present. | 2022-01-15 |
Payment Method | The payment method used for the transaction. May contain missing values due to data dirt. | Cash , None |
Data Dirtiness:
Item
, Price
, Quantity
, Order Total
, Payment Method
) simulate real-world challenges.Item
is present.Price
is present.Quantity
and Order Total
are present.Price
or Quantity
is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity
).Menu Categories and Items:
Chicken Melt
, French Fries
.Grilled Chicken
, Steak
.Chocolate Cake
, Ice Cream
.Coca Cola
, Water
.Mashed Potatoes
, Garlic Bread
.3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.
Handle Missing Values:
Order Total
or Quantity
using the formula: Order Total = Price * Quantity
.Price
from Order Total / Quantity
if both are available.Validate Data Consistency:
Order Total = Price * Quantity
) match.Analyze Missing Patterns:
Category | Item | Price |
---|---|---|
Starters | Chicken Melt | 8.0 |
Starters | French Fries | 4.0 |
Starters | Cheese Fries | 5.0 |
Starters | Sweet Potato Fries | 5.0 |
Starters | Beef Chili | 7.0 |
Starters | Nachos Grande | 10.0 |
Main Dishes | Grilled Chicken | 15.0 |
Main Dishes | Steak | 20.0 |
Main Dishes | Pasta Alfredo | 12.0 |
Main Dishes | Salmon | 18.0 |
Main Dishes | Vegetarian Platter | 14.0 |
Desserts | Chocolate Cake | 6.0 |
Desserts | Ice Cream | 5.0 |
Desserts | Fruit Salad | 4.0 |
Desserts | Cheesecake | 7.0 |
Desserts | Brownie | 6.0 |
Drinks | Coca Cola | 2.5 |
Drinks | Orange Juice | 3.0 |
Drinks ... |
This dataset is an initial set of locations identified using the criteria described above. However, these data were subsetted to a smaller set of 20,000 randomly selected locations that did not contain missing values for any of the variables required in the subsequent analysis. The full procedure for this subsetting is contained in the associated Github repository (https://github.com/absimler/nonrandom-seedings).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Ecologists use classifications of individuals in categories to understand composition of populations and communities. These categories might be defined by demographics, functional traits, or species. Assignment of categories is often imperfect, but frequently treated as observations without error. When individuals are observed but not classified, these “partial” observations must be modified to include the missing data mechanism to avoid spurious inference.
We developed two hierarchical Bayesian models to overcome the assumption of perfect assignment to mutually exclusive categories in the multinomial distribution of categorical counts, when classifications are missing. These models incorporate auxiliary information to adjust the posterior distributions of the proportions of membership in categories. In one model, we use an empirical Bayes approach, where a subset of data from one year serves as a prior for the missing data the next. In the other approach, we use a small random sample of data within a year to inform the distribution of the missing data.
We performed a simulation to show the bias that occurs when partial observations were ignored and demonstrated the altered inference for the estimation of demographic ratios. We applied our models to demographic classifications of elk (Cervus elaphus nelsoni) to demonstrate improved inference for the proportions of sex and stage classes.
We developed multiple modeling approaches using a generalizable nested multinomial structure to account for partially observed data that were missing not at random for classification counts. Accounting for classification uncertainty is important to accurately understand the composition of populations and communities in ecological studies.
1.The analysis of morphological diversity frequently relies on the use of multivariate methods for characterizing biological shape. However, many of these methods are intolerant of missing data, which can limit the use of rare taxa and hinder the study of broad patterns of ecological diversity and morphological evolution. This study applied a mutli-dataset approach to compare variation in missing data estimation and its effect on geometric morphometric analysis across taxonomically-variable groups, landmark position and sample sizes. 2.Missing morphometric landmark data was simulated from five real, complete datasets, including modern fish, primates and extinct theropod dinosaurs. Missing landmarks were then estimated using several standard approaches and a geometric-morphometric-specific method. The accuracy of missing data estimation was determined for each estimation method, landmark position, and morphological dataset. Procrustes superimposition was used to compare the eigenvectors and principal component scores of a geometric morphometric analysis of the original landmark data, to datasets with A) missing values estimated, or B) simulated incomplete specimens excluded, for varying levels of specimens incompleteness and sample sizes. 3.Standard estimation techniques were more reliable estimators and had lower impacts on morphometric analysis compared to a geometric-morphometric-specific estimator. For most datasets and estimation techniques, estimating missing data produced a better fit to the structure of the original data than exclusion of incomplete specimens, and this was maintained even at considerably reduced sample sizes. The impact of missing data on geometric morphometric analysis was disproportionately affected by the most fragmentary specimens. 4.Missing data estimation was influenced by variability of specific anatomical features, and may be improved by a better understanding of shape variation present in a dataset. Our results suggest that the inclusion of incomplete specimens through the use of effective missing data estimators better reflects the patterns of shape variation within a dataset than using only complete specimens, however the effectiveness of missing data estimation can be maximized by excluding only the most incomplete specimens. It is advised that missing data estimators be evaluated for each dataset and landmark independently, as the effectiveness of estimators can vary strongly and unpredictably between different taxa and structures.
1.110 respondents from CH, DE, FR and UK, individual level data, Facebook recruited via Limesurvey, timeframe of data collection: July - December 2020. Indiviuals 18+ were targeted (+ filtered on question on voting rights). Questions on democratic ideals (likert scales and 100-points), socio-demographics, pre-study for question testing, not representative, Entries without data were deleted. Dummy-variables at end only for countries (based on variable "country"). No missing values treatment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318