11 datasets found

Water-quality data imputation with a high percentage of missing values: a...
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 8, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4731169
Dataset updated
Jun 8, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318
f
Dataset for: Avoiding pitfalls when combining multiple imputation and...
wiley.figshare.com
docx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Granger; Jamie Sergeant; Mark Lunt (2023). Dataset for: Avoiding pitfalls when combining multiple imputation and propensity scores [Dataset]. http://doi.org/10.6084/m9.figshare.9253178.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9253178.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Wiley
Authors
Emily Granger; Jamie Sergeant; Mark Lunt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Overcoming bias due to confounding and missing data is challenging when analysing observational data. Propensity scores are commonly used to account for the first problem and multiple imputation for the latter. Unfortunately, it is not known how best to proceed when both techniques are required. We investigate whether two different approaches to combining propensity scores and multiple imputation (Across and Within) lead to differences in the accuracy or precision of exposure effect estimates. Both approaches start by imputing missing values multiple times. Propensity scores are then estimated for each resulting dataset. Using the Across approach, the mean propensity score across imputations for each subject is used in a single subsequent analysis. Alternatively, the Within approach uses propensity scores individually to obtain exposure effect estimates in each imputation, which are combined to produce an overall estimate. These approaches were compared in a series of Monte Carlo simulations and applied to data from the British Society for Rheumatology Biologics Register. Results indicated that the Within approach produced unbiased estimates with appropriate confidence intervals, whereas the Across approach produced biased results and unrealistic confidence intervals. Researchers are encouraged to implement the Within approach when conducting propensity score analyses with incomplete data.
o
Data from: Incomplete specimens in geometric morphometric analyses
explore.openaire.eu
search.dataone.org
+2more
Updated Jan 1, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica H. Arbour; Caleb M. Brown (2013). Data from: Incomplete specimens in geometric morphometric analyses [Dataset]. http://doi.org/10.5061/dryad.mp713
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.mp713
Dataset updated
Jan 1, 2013
Authors
Jessica H. Arbour; Caleb M. Brown
Description
1.The analysis of morphological diversity frequently relies on the use of multivariate methods for characterizing biological shape. However, many of these methods are intolerant of missing data, which can limit the use of rare taxa and hinder the study of broad patterns of ecological diversity and morphological evolution. This study applied a mutli-dataset approach to compare variation in missing data estimation and its effect on geometric morphometric analysis across taxonomically-variable groups, landmark position and sample sizes. 2.Missing morphometric landmark data was simulated from five real, complete datasets, including modern fish, primates and extinct theropod dinosaurs. Missing landmarks were then estimated using several standard approaches and a geometric-morphometric-specific method. The accuracy of missing data estimation was determined for each estimation method, landmark position, and morphological dataset. Procrustes superimposition was used to compare the eigenvectors and principal component scores of a geometric morphometric analysis of the original landmark data, to datasets with A) missing values estimated, or B) simulated incomplete specimens excluded, for varying levels of specimens incompleteness and sample sizes. 3.Standard estimation techniques were more reliable estimators and had lower impacts on morphometric analysis compared to a geometric-morphometric-specific estimator. For most datasets and estimation techniques, estimating missing data produced a better fit to the structure of the original data than exclusion of incomplete specimens, and this was maintained even at considerably reduced sample sizes. The impact of missing data on geometric morphometric analysis was disproportionately affected by the most fragmentary specimens. 4.Missing data estimation was influenced by variability of specific anatomical features, and may be improved by a better understanding of shape variation present in a dataset. Our results suggest that the inclusion of incomplete specimens through the use of effective missing data estimators better reflects the patterns of shape variation within a dataset than using only complete specimens, however the effectiveness of missing data estimation can be maximized by excluding only the most incomplete specimens. It is advised that missing data estimators be evaluated for each dataset and landmark independently, as the effectiveness of estimators can vary strongly and unpredictably between different taxa and structures.
A
‘COVID-19 Reported Patient Impact and Hospital Capacity by State’ analyzed...
analyst-2.ai
Updated Feb 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 Reported Patient Impact and Hospital Capacity by State’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-covid-19-reported-patient-impact-and-hospital-capacity-by-state-4378/68cc7822/?iid=028-195&v=presentation
Explore at:
Dataset updated
Feb 11, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘COVID-19 Reported Patient Impact and Hospital Capacity by State’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/66a46309-d465-47bc-9997-210532ebbf63 on 11 February 2022.

--- Dataset description provided by original source is as follows ---

The following dataset provides state-aggregated data for hospital utilization. These are derived from reports with facility-level granularity across two main sources: (1) HHS TeleTracking, and (2) reporting provided directly to HHS Protect by state/territorial health departments on behalf of their healthcare facilities.

The file will be updated daily and provides the latest values reported by each facility within the last four days. This allows for a more comprehensive picture of the hospital utilization within a state by ensuring a hospital is represented, even if they miss a single day of reporting.

No statistical analysis is applied to account for non-response and/or to account for missing data.

The below table displays one value for each field (i.e., column). Sometimes, reports for a given facility will be provided to both HHS TeleTracking and HHS Protect. When this occurs, to ensure that there are not duplicate reports, deduplication is applied: specifically, HHS selects the TeleTracking record provided directly by the facility over the state-provided data to HHS Protect.

On April 29, 2021, this data set has had the following fields added: previous_day_admission_adult_covid_confirmed_18-19 previous_day_admission_adult_covid_confirmed_18-19_coverage previous_day_admission_adult_covid_confirmed_20-29_coverage previous_day_admission_adult_covid_confirmed_30-39 previous_day_admission_adult_covid_confirmed_30-39_coverage previous_day_admission_adult_covid_confirmed_40-49 previous_day_admission_adult_covid_confirmed_40-49_coverage previous_day_admission_adult_covid_confirmed_40-49_coverage previous_day_admission_adult_covid_confirmed_50-59 previous_day_admission_adult_covid_confirmed_50-59_coverage previous_day_admission_adult_covid_confirmed_60-69 previous_day_admission_adult_covid_confirmed_60-69_coverage previous_day_admission_adult_covid_confirmed_70-79 previous_day_admission_adult_covid_confirmed_70-79_coverage previous_day_admission_adult_covid_confirmed_80+ previous_day_admission_adult_covid_confirmed_80+_coverage previous_day_admission_adult_covid_confirmed_unknown previous_day_admission_adult_covid_confirmed_unknown_coverage previous_day_admission_adult_covid_suspected_18-19 previous_day_admission_adult_covid_suspected_18-19_coverage previous_day_admission_adult_covid_suspected_20-29 previous_day_admission_adult_covid_suspected_20-29_coverage previous_day_admission_adult_covid_suspected_30-39 previous_day_admission_adult_covid_suspected_30-39_coverage previous_day_admission_adult_covid_suspected_40-49 previous_day_admission_adult_covid_suspected_40-49_coverage previous_day_admission_adult_covid_suspected_50-59 previous_day_admission_adult_covid_suspected_50-59_coverage previous_day_admission_adult_covid_suspected_60-69 previous_day_admission_adult_covid_suspected_60-69_coverage previous_day_admission_adult_covid_suspected_70-79 previous_day_admission_adult_covid_suspected_70-79_coverage previous_day_admission_adult_covid_suspected_80+ previous_day_admission_adult_covid_suspected_80+_coverage previous_day_admission_adult_covid_suspected_unknown previous_day_admission_adult_covid_suspected_unknown_coverage

On June 30, 2021, this data set has had the following fields added: deaths_covid deaths_covid_coverage

On September 13, 2021, this data set has had the following fields added: on_hand_supply_therapeutic_a_casirivimab_imdevimab_courses, on_hand_supply_therapeutic_b_bamlanivimab_courses, on_hand_supply_therapeutic_c_bamlanivimab_etesevimab_courses, previous_week_therapeutic_a_casirivimab_imdevimab_courses_used, previous_week_therapeutic_b_bamlanivimab_courses_used, previous_week_therapeutic_c_bamlanivimab_etesevimab_courses_used

On September 17, 2021, this data set has had the following fields added: icu_patients_confirmed_influenza, icu_patients_confirmed_influenza_coverage, previous_day_admission_influenza_confirmed, previous_day_admission_infl

--- Original source retains full ownership of the source dataset ---
w
Bright II 2012-2013 - Burkina Faso
microdata.worldbank.org
catalog.ihsn.org
+1more
Updated Mar 27, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathematica Policy Research (2019). Bright II 2012-2013 - Burkina Faso [Dataset]. https://microdata.worldbank.org/index.php/catalog/3430
Explore at:
Dataset updated
Mar 27, 2019
Dataset authored and provided by
Mathematica Policy Research
Time period covered
2012 - 2013
Area covered
Burkina Faso
Description
Abstract

Millennium Challenge Corporation hired Mathematica Policy Research to conduct an independent evaluation of the BRIGHT II program. The three main research questions of interest are: • What was the impact of the program on school enrollment, attendance, and retention? • What was the impact of the program on test scores? • Are the impacts different for girls than for boys?

Mathematica will compare data collected from the 132 communities served by BRIGHT II (the "treatment group") with that collected from the 161 communities that applied but were not selected for the program (the "comparison group"). Using a statistical technique called regression discontinuity, Mathematica will compare the outcomes of the treatment villages just above the cutoff point to the outcomes of the comparison villages just below the cutoff point. If the intervention had an impact, we will observe a "jump" in outcomes at the point of discontinuity.

Mathematica will perform additional analyses to estimate the overall merit of the BRIGHT investment. By conducting a cost-benefit analysis and a cost-effectiveness analysis and calculating the economic rate of return, Mathematica will be able to answer questions related to the sustainability of the program, and compare the program to interventions and social investments in other sectors. The household survey is designed to capture household-level data rather than community-level data; however, questions have been included to measure head-of-household expectations of educational attainment. These questions ask the head of household what grade level he hopes each child will attain; and what grade level he thinks the child will be capable of achieving in reality.

Geographic coverage

132 rural villages throughout the 10 provinces of Burkina Faso in which girls' enrollment rates were lowest

Analysis unit

Households

Universe

Households, students, and educators in the 287 villages surveyed

Kind of data

Sample survey data [ssd]

Sampling procedure

The BRIGHT II program was implemented in the same 132 villages that received the BRIGHT I interventions. These 132 villages were originally selected using a scoring process, with eligibility scores based on the villages’ potential to improve girls’ educational outcomes. A total of 293 villages applied to receive a BRIGHT school; the Burkina Faso Ministry of Basic Education (MEBA) selected the 132 villages with scores that were above a certain cutoff point. Whenever possible, the survey will be conducted with the same children in the same households and schools surveyed during the BRIGHT I evaluation. By visiting the same households and schools, the evaluator will be able to better assess the longer-term impacts of the BRIGHT project.

Research instrument

Mathematica has developed two surveys, a household survey and a school survey, to collect relevant data from villages in both the treatment and comparison groups. The household survey was administered to a new cross-section of households compared to the BRIGHT I evaluation. Data will be collected on the attendance and educational attainment of school-age children in the household, attitudes towards girls' education, and parental assessment of the extent to which the complementary interventions influenced school enrollment decisions. It will also assess the performance of all household children on basic tests of French and math. The school survey, to be administered to all local schools in the 293 villages, gathers data on school characteristics, personnel, and physical structure, and collects enrollment and attendance records. Data will be gathered by a local data collection firm selected by MCA-Burkina Faso, with Mathematica providing technical assistance and oversight.

Cleaning operations

Following data collection, Mathematica will work with BERD to ensure that the data are correctly entered and are complete and clean. This will include a review of all frequencies for out-of-range responses, missing data, or other problems, as well as a comparison between the data and paper copies for a random selection of variables.
Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Superstore Sales Analysis
kaggle.com
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
A
‘Transactional Retail Dataset of Electronics Store’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Transactional Retail Dataset of Electronics Store’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-transactional-retail-dataset-of-electronics-store-e86c/6f6d91df/?iid=000-353&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Transactional Retail Dataset of Electronics Store’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/muhammadshahrayar/transactional-retail-dataset-of-electronics-store on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset contains information about an online electronic store. The store has three warehouses from which goods are delivered to customers.

Columns Description

order_id: A unique id for each order

customer_id: A unique id for each customer

date: The date the order was made, given in YYYY-MM-DD format

nearest_warehouse: A string denoting the name of the nearest warehouse to the customer

shopping_cart: A list of tuples representing the order items: the first element of the tuple is the item ordered, and the second element is the quantity ordered for such item.

order_price: A float denoting the order price in USD. The order price is the price of items before any discounts and/or delivery charges are applied.

delivery_charges: A float representing the delivery charges of the order

customer_lat: Latitude of the customer’s location

customer_long: Longitude of the customer’s location

coupon_discount: An integer denoting the percentage discount to be applied to the order_price.

order_total: A float denoting the total of the order in USD after all discounts and/or delivery charges are applied.

season: A string denoting the season in which the order was placed.

is_expedited_delivery: A boolean denoting whether the customer has requested an expedited delivery

distance_to_nearest_warehouse: A float representing the arc distance, in kilometres, between the customer and the nearest warehouse to him/her.

latest_customer_review: A string representing the latest customer review on his/her most recent order

is_happy_customer: A boolean denoting whether the customer is a happy customer or had an issue with his/her last order.

Inspiration

Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. - Detect and fix errors in dirty_data.csv - Impute the missing values in missing_data.csv - Detect and remove Anolamies - To check whether a customer is happy with their last order

All the Best

--- Original source retains full ownership of the source dataset ---
f
Direct prediction of regulatory elements from partial data without...
plos.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Zhang; Shaun Mahony (2023). Direct prediction of regulatory elements from partial data without imputation [Dataset]. http://doi.org/10.1371/journal.pcbi.1007399
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1007399
Dataset updated
May 31, 2023
Dataset provided by
PLOS Computational Biology
Authors
Yu Zhang; Shaun Mahony
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Genome segmentation approaches allow us to characterize regulatory states in a given cell type using combinatorial patterns of histone modifications and other regulatory signals. In order to analyze regulatory state differences across cell types, current genome segmentation approaches typically require that the same regulatory genomics assays have been performed in all analyzed cell types. This necessarily limits both the numbers of cell types that can be analyzed and the complexity of the resulting regulatory states, as only a small number of histone modifications have been profiled across many cell types. Data imputation approaches that aim to estimate missing regulatory signals have been applied before genome segmentation. However, this approach is computationally costly and propagates any errors in imputation to produce incorrect genome segmentation results downstream. We present an extension to the IDEAS genome segmentation platform which can perform genome segmentation on incomplete regulatory genomics dataset collections without using imputation. Instead of relying on imputed data, we use an expectation-maximization approach to estimate marginal density functions within each regulatory state. We demonstrate that our genome segmentation results compare favorably with approaches based on imputation or other strategies for handling missing data. We further show that our approach can accurately impute missing data after genome segmentation, reversing the typical order of imputation/genome segmentation pipelines. Finally, we present a new 2D genome segmentation analysis of 127 human cell types studied by the Roadmap Epigenomics Consortium. By using an expanded set of chromatin marks that have been profiled in subsets of these cell types, our new segmentation results capture a more complex picture of combinatorial regulatory patterns that appear on the human genome.
Transactional Retail Dataset of Electronics Store
kaggle.com
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahrayar (2021). Transactional Retail Dataset of Electronics Store [Dataset]. https://www.kaggle.com/muhammadshahrayar/transactional-retail-dataset-of-electronics-store/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shahrayar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains information about an online electronic store. The store has three warehouses from which goods are delivered to customers.

Columns Description

order_id: A unique id for each order

customer_id: A unique id for each customer

date: The date the order was made, given in YYYY-MM-DD format

nearest_warehouse: A string denoting the name of the nearest warehouse to the customer

shopping_cart: A list of tuples representing the order items: the first element of the tuple is the item ordered, and the second element is the quantity ordered for such item.

order_price: A float denoting the order price in USD. The order price is the price of items before any discounts and/or delivery charges are applied.

delivery_charges: A float representing the delivery charges of the order

customer_lat: Latitude of the customer’s location

customer_long: Longitude of the customer’s location

coupon_discount: An integer denoting the percentage discount to be applied to the order_price.

order_total: A float denoting the total of the order in USD after all discounts and/or delivery charges are applied.

season: A string denoting the season in which the order was placed.

is_expedited_delivery: A boolean denoting whether the customer has requested an expedited delivery

distance_to_nearest_warehouse: A float representing the arc distance, in kilometres, between the customer and the nearest warehouse to him/her.

latest_customer_review: A string representing the latest customer review on his/her most recent order

is_happy_customer: A boolean denoting whether the customer is a happy customer or had an issue with his/her last order.

Inspiration

Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. - Detect and fix errors in dirty_data.csv - Impute the missing values in missing_data.csv - Detect and remove Anolamies - To check whether a customer is happy with their last order

All the Best
n
Data from: A new method for handling missing species in diversification...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 6, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalie Cusimano; Tanja Stadler; Susanne S. Renner (2012). A new method for handling missing species in diversification analysis applicable to randomly or non-randomly sampled phylogenies [Dataset]. http://doi.org/10.5061/dryad.r8f04fk2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.r8f04fk2
Dataset updated
Jan 6, 2012
Dataset provided by
Ludwig-Maximilians-Universität München
Authors
Natalie Cusimano; Tanja Stadler; Susanne S. Renner
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Chronograms from molecular dating are increasingly being used to infer rates of diversification and their change over time. A major limitation in such analyses is incomplete species sampling that moreover is usually non-random. While the widely used γ statistic with the MCCR test or the birth-death likelihood analysis with the ∆AICrc test statistic are appropriate for comparing the fit of different diversification models in phylogenies with random species sampling, no objective, automated method has been developed for fitting diversification models to non-randomly sampled phylogenies. Here we introduce a novel approach, CorSiM, which involves simulating missing splits under a constant-rate birth-death model and allows the user to specify whether species sampling in the phylogeny being analyzed is random or non-random. The completed trees can be used in subsequent model-fitting analyses. This is fundamentally different from previous diversification rate estimation methods, which were based on null distributions derived from the incomplete trees. CorSiM is automated in an R package and can easily be applied to large data sets. We illustrate the approach in two Araceae clades, one with a random species sampling of 52% and one with a non-random sampling of 55%. In the latter clade, the CorSiM approach detects and quantifies an increase in diversification rate while classic approaches prefer a constant rate model, whereas in the former clade, results do not differ among methods (as indeed expected since the classic approaches are valid only for randomly sampled phylogenies). The CorSiM method greatly reduces the type I error in diversification analysis, but type II error remains a methodological problem.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Rodríguez; Rafael Rodríguez; Marcos Pastorini; Marcos Pastorini; Lorena Etcheverry; Lorena Etcheverry; Christian Chreties; Mónica Fossati; Alberto Castro; Alberto Castro; Angela Gorgoglione; Angela Gorgoglione; Christian Chreties; Mónica Fossati (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. http://doi.org/10.5281/zenodo.4731169

Water-quality data imputation with a high percentage of missing values: a machine learning approach

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4731169

Dataset updated

Jun 8, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)
Dissolved oxygen (DO)
Electrical conductivity (EC)
pH
Turbidity (Turb)
Nitrite (NO2-)
Nitrate (NO3-)
Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper:
Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Clear search

Close search

Google apps

Main menu

Water-quality data imputation with a high percentage of missing values: a...

Dataset for: Avoiding pitfalls when combining multiple imputation and...

Data from: Incomplete specimens in geometric morphometric analyses

‘COVID-19 Reported Patient Impact and Hospital Capacity by State’ analyzed...

Bright II 2012-2013 - Burkina Faso

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Research instrument

Cleaning operations

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

Superstore Sales Analysis

‘Transactional Retail Dataset of Electronics Store’ analyzed by Analyst-2

Context

Columns Description

Inspiration

Direct prediction of regulatory elements from partial data without...

Transactional Retail Dataset of Electronics Store

Context

Columns Description

Inspiration

Data from: A new method for handling missing species in diversification...

Water-quality data imputation with a high percentage of missing values: a machine learning approachSee More Versions

Water-quality data imputation with a high percentage of missing values: a machine learning approach