Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Overcoming bias due to confounding and missing data is challenging when analysing observational data. Propensity scores are commonly used to account for the first problem and multiple imputation for the latter. Unfortunately, it is not known how best to proceed when both techniques are required. We investigate whether two different approaches to combining propensity scores and multiple imputation (Across and Within) lead to differences in the accuracy or precision of exposure effect estimates. Both approaches start by imputing missing values multiple times. Propensity scores are then estimated for each resulting dataset. Using the Across approach, the mean propensity score across imputations for each subject is used in a single subsequent analysis. Alternatively, the Within approach uses propensity scores individually to obtain exposure effect estimates in each imputation, which are combined to produce an overall estimate. These approaches were compared in a series of Monte Carlo simulations and applied to data from the British Society for Rheumatology Biologics Register. Results indicated that the Within approach produced unbiased estimates with appropriate confidence intervals, whereas the Across approach produced biased results and unrealistic confidence intervals. Researchers are encouraged to implement the Within approach when conducting propensity score analyses with incomplete data.
1.The analysis of morphological diversity frequently relies on the use of multivariate methods for characterizing biological shape. However, many of these methods are intolerant of missing data, which can limit the use of rare taxa and hinder the study of broad patterns of ecological diversity and morphological evolution. This study applied a mutli-dataset approach to compare variation in missing data estimation and its effect on geometric morphometric analysis across taxonomically-variable groups, landmark position and sample sizes. 2.Missing morphometric landmark data was simulated from five real, complete datasets, including modern fish, primates and extinct theropod dinosaurs. Missing landmarks were then estimated using several standard approaches and a geometric-morphometric-specific method. The accuracy of missing data estimation was determined for each estimation method, landmark position, and morphological dataset. Procrustes superimposition was used to compare the eigenvectors and principal component scores of a geometric morphometric analysis of the original landmark data, to datasets with A) missing values estimated, or B) simulated incomplete specimens excluded, for varying levels of specimens incompleteness and sample sizes. 3.Standard estimation techniques were more reliable estimators and had lower impacts on morphometric analysis compared to a geometric-morphometric-specific estimator. For most datasets and estimation techniques, estimating missing data produced a better fit to the structure of the original data than exclusion of incomplete specimens, and this was maintained even at considerably reduced sample sizes. The impact of missing data on geometric morphometric analysis was disproportionately affected by the most fragmentary specimens. 4.Missing data estimation was influenced by variability of specific anatomical features, and may be improved by a better understanding of shape variation present in a dataset. Our results suggest that the inclusion of incomplete specimens through the use of effective missing data estimators better reflects the patterns of shape variation within a dataset than using only complete specimens, however the effectiveness of missing data estimation can be maximized by excluding only the most incomplete specimens. It is advised that missing data estimators be evaluated for each dataset and landmark independently, as the effectiveness of estimators can vary strongly and unpredictably between different taxa and structures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID-19 Reported Patient Impact and Hospital Capacity by State’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/66a46309-d465-47bc-9997-210532ebbf63 on 11 February 2022.
--- Dataset description provided by original source is as follows ---
The following dataset provides state-aggregated data for hospital utilization. These are derived from reports with facility-level granularity across two main sources: (1) HHS TeleTracking, and (2) reporting provided directly to HHS Protect by state/territorial health departments on behalf of their healthcare facilities.
The file will be updated daily and provides the latest values reported by each facility within the last four days. This allows for a more comprehensive picture of the hospital utilization within a state by ensuring a hospital is represented, even if they miss a single day of reporting.
No statistical analysis is applied to account for non-response and/or to account for missing data.
The below table displays one value for each field (i.e., column). Sometimes, reports for a given facility will be provided to both HHS TeleTracking and HHS Protect. When this occurs, to ensure that there are not duplicate reports, deduplication is applied: specifically, HHS selects the TeleTracking record provided directly by the facility over the state-provided data to HHS Protect.
On April 29, 2021, this data set has had the following fields added:
previous_day_admission_adult_covid_confirmed_18-19
previous_day_admission_adult_covid_confirmed_18-19_coverage
previous_day_admission_adult_covid_confirmed_20-29_coverage
previous_day_admission_adult_covid_confirmed_30-39
previous_day_admission_adult_covid_confirmed_30-39_coverage
previous_day_admission_adult_covid_confirmed_40-49
previous_day_admission_adult_covid_confirmed_40-49_coverage
previous_day_admission_adult_covid_confirmed_40-49_coverage
previous_day_admission_adult_covid_confirmed_50-59
previous_day_admission_adult_covid_confirmed_50-59_coverage
previous_day_admission_adult_covid_confirmed_60-69
previous_day_admission_adult_covid_confirmed_60-69_coverage
previous_day_admission_adult_covid_confirmed_70-79
previous_day_admission_adult_covid_confirmed_70-79_coverage
previous_day_admission_adult_covid_confirmed_80+
previous_day_admission_adult_covid_confirmed_80+_coverage
previous_day_admission_adult_covid_confirmed_unknown
previous_day_admission_adult_covid_confirmed_unknown_coverage
previous_day_admission_adult_covid_suspected_18-19
previous_day_admission_adult_covid_suspected_18-19_coverage
previous_day_admission_adult_covid_suspected_20-29
previous_day_admission_adult_covid_suspected_20-29_coverage
previous_day_admission_adult_covid_suspected_30-39
previous_day_admission_adult_covid_suspected_30-39_coverage
previous_day_admission_adult_covid_suspected_40-49
previous_day_admission_adult_covid_suspected_40-49_coverage
previous_day_admission_adult_covid_suspected_50-59
previous_day_admission_adult_covid_suspected_50-59_coverage
previous_day_admission_adult_covid_suspected_60-69
previous_day_admission_adult_covid_suspected_60-69_coverage
previous_day_admission_adult_covid_suspected_70-79
previous_day_admission_adult_covid_suspected_70-79_coverage
previous_day_admission_adult_covid_suspected_80+
previous_day_admission_adult_covid_suspected_80+_coverage
previous_day_admission_adult_covid_suspected_unknown
previous_day_admission_adult_covid_suspected_unknown_coverage
On June 30, 2021, this data set has had the following fields added:
deaths_covid
deaths_covid_coverage
On September 13, 2021, this data set has had the following fields added:
on_hand_supply_therapeutic_a_casirivimab_imdevimab_courses,
on_hand_supply_therapeutic_b_bamlanivimab_courses,
on_hand_supply_therapeutic_c_bamlanivimab_etesevimab_courses,
previous_week_therapeutic_a_casirivimab_imdevimab_courses_used,
previous_week_therapeutic_b_bamlanivimab_courses_used,
previous_week_therapeutic_c_bamlanivimab_etesevimab_courses_used
On September 17, 2021, this data set has had the following fields added:
icu_patients_confirmed_influenza,
icu_patients_confirmed_influenza_coverage,
previous_day_admission_influenza_confirmed,
previous_day_admission_infl
--- Original source retains full ownership of the source dataset ---
Millennium Challenge Corporation hired Mathematica Policy Research to conduct an independent evaluation of the BRIGHT II program. The three main research questions of interest are: • What was the impact of the program on school enrollment, attendance, and retention? • What was the impact of the program on test scores? • Are the impacts different for girls than for boys?
Mathematica will compare data collected from the 132 communities served by BRIGHT II (the "treatment group") with that collected from the 161 communities that applied but were not selected for the program (the "comparison group"). Using a statistical technique called regression discontinuity, Mathematica will compare the outcomes of the treatment villages just above the cutoff point to the outcomes of the comparison villages just below the cutoff point. If the intervention had an impact, we will observe a "jump" in outcomes at the point of discontinuity.
Mathematica will perform additional analyses to estimate the overall merit of the BRIGHT investment. By conducting a cost-benefit analysis and a cost-effectiveness analysis and calculating the economic rate of return, Mathematica will be able to answer questions related to the sustainability of the program, and compare the program to interventions and social investments in other sectors. The household survey is designed to capture household-level data rather than community-level data; however, questions have been included to measure head-of-household expectations of educational attainment. These questions ask the head of household what grade level he hopes each child will attain; and what grade level he thinks the child will be capable of achieving in reality.
132 rural villages throughout the 10 provinces of Burkina Faso in which girls' enrollment rates were lowest
Households
Households, students, and educators in the 287 villages surveyed
Sample survey data [ssd]
The BRIGHT II program was implemented in the same 132 villages that received the BRIGHT I interventions. These 132 villages were originally selected using a scoring process, with eligibility scores based on the villages’ potential to improve girls’ educational outcomes. A total of 293 villages applied to receive a BRIGHT school; the Burkina Faso Ministry of Basic Education (MEBA) selected the 132 villages with scores that were above a certain cutoff point. Whenever possible, the survey will be conducted with the same children in the same households and schools surveyed during the BRIGHT I evaluation. By visiting the same households and schools, the evaluator will be able to better assess the longer-term impacts of the BRIGHT project.
Mathematica has developed two surveys, a household survey and a school survey, to collect relevant data from villages in both the treatment and comparison groups. The household survey was administered to a new cross-section of households compared to the BRIGHT I evaluation. Data will be collected on the attendance and educational attainment of school-age children in the household, attitudes towards girls' education, and parental assessment of the extent to which the complementary interventions influenced school enrollment decisions. It will also assess the performance of all household children on basic tests of French and math. The school survey, to be administered to all local schools in the 293 villages, gathers data on school characteristics, personnel, and physical structure, and collects enrollment and attendance records. Data will be gathered by a local data collection firm selected by MCA-Burkina Faso, with Mathematica providing technical assistance and oversight.
Following data collection, Mathematica will work with BERD to ensure that the data are correctly entered and are complete and clean. This will include a review of all frequencies for out-of-range responses, missing data, or other problems, as well as a comparison between the data and paper copies for a random selection of variables.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:
1- Data Import and Transformation:
2- Data Quality Assessment:
3- Calculating COGS:
4- Discount Analysis:
5- Sales Metrics:
6- Visualization:
7- Report Generation:
Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Transactional Retail Dataset of Electronics Store’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/muhammadshahrayar/transactional-retail-dataset-of-electronics-store on 14 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains information about an online electronic store. The store has three warehouses from which goods are delivered to customers.
Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. - Detect and fix errors in dirty_data.csv - Impute the missing values in missing_data.csv - Detect and remove Anolamies - To check whether a customer is happy with their last order
All the Best
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Genome segmentation approaches allow us to characterize regulatory states in a given cell type using combinatorial patterns of histone modifications and other regulatory signals. In order to analyze regulatory state differences across cell types, current genome segmentation approaches typically require that the same regulatory genomics assays have been performed in all analyzed cell types. This necessarily limits both the numbers of cell types that can be analyzed and the complexity of the resulting regulatory states, as only a small number of histone modifications have been profiled across many cell types. Data imputation approaches that aim to estimate missing regulatory signals have been applied before genome segmentation. However, this approach is computationally costly and propagates any errors in imputation to produce incorrect genome segmentation results downstream. We present an extension to the IDEAS genome segmentation platform which can perform genome segmentation on incomplete regulatory genomics dataset collections without using imputation. Instead of relying on imputed data, we use an expectation-maximization approach to estimate marginal density functions within each regulatory state. We demonstrate that our genome segmentation results compare favorably with approaches based on imputation or other strategies for handling missing data. We further show that our approach can accurately impute missing data after genome segmentation, reversing the typical order of imputation/genome segmentation pipelines. Finally, we present a new 2D genome segmentation analysis of 127 human cell types studied by the Roadmap Epigenomics Consortium. By using an expanded set of chromatin marks that have been profiled in subsets of these cell types, our new segmentation results capture a more complex picture of combinatorial regulatory patterns that appear on the human genome.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains information about an online electronic store. The store has three warehouses from which goods are delivered to customers.
Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. - Detect and fix errors in dirty_data.csv - Impute the missing values in missing_data.csv - Detect and remove Anolamies - To check whether a customer is happy with their last order
All the Best
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Chronograms from molecular dating are increasingly being used to infer rates of diversification and their change over time. A major limitation in such analyses is incomplete species sampling that moreover is usually non-random. While the widely used γ statistic with the MCCR test or the birth-death likelihood analysis with the ∆AICrc test statistic are appropriate for comparing the fit of different diversification models in phylogenies with random species sampling, no objective, automated method has been developed for fitting diversification models to non-randomly sampled phylogenies. Here we introduce a novel approach, CorSiM, which involves simulating missing splits under a constant-rate birth-death model and allows the user to specify whether species sampling in the phylogeny being analyzed is random or non-random. The completed trees can be used in subsequent model-fitting analyses. This is fundamentally different from previous diversification rate estimation methods, which were based on null distributions derived from the incomplete trees. CorSiM is automated in an R package and can easily be applied to large data sets. We illustrate the approach in two Araceae clades, one with a random species sampling of 52% and one with a non-random sampling of 55%. In the latter clade, the CorSiM approach detects and quantifies an increase in diversification rate while classic approaches prefer a constant rate model, whereas in the former clade, results do not differ among methods (as indeed expected since the classic approaches are valid only for randomly sampled phylogenies). The CorSiM method greatly reduces the type I error in diversification analysis, but type II error remains a methodological problem.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.
This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.
To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).
IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.
In this dataset, we include the original and imputed values for the following variables:
Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)
Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].
More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.
If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318