63 datasets found
  1. Z

    Data from: Water-quality data imputation with a high percentage of missing...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Etcheverry (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4731168
    Explore at:
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Alberto Castro
    Mónica Fossati
    Angela Gorgoglione
    Marcos Pastorini
    Rafael Rodríguez
    Lorena Etcheverry
    Christian Chreties
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

    This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

    To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

    IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

    In this dataset, we include the original and imputed values for the following variables:

    Water temperature (Tw)

    Dissolved oxygen (DO)

    Electrical conductivity (EC)

    pH

    Turbidity (Turb)

    Nitrite (NO2-)

    Nitrate (NO3-)

    Total Nitrogen (TN)

    Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

    More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

    If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

  2. f

    Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

    • figshare.com
    pdf
    Updated Oct 12, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolaj Bak; Lars K. Hansen (2016). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 12, 2016
    Dataset provided by
    PLOS ONE
    Authors
    Nikolaj Bak; Lars K. Hansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

  3. H

    Replication data for: What To Do about Missing Data in Time-Series...

    • dataverse.harvard.edu
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Honaker; Gary King (2024). Replication data for: What To Do about Missing Data in Time-Series Cross-Sectional Data [Dataset]. http://doi.org/10.7910/DVN/GGUR0P
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    James Honaker; Gary King
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Applications of modern methods for analyzing data with missing values, based primarily on multiple imputation, have in the last half-decade become common in American politics and political behavior. Scholars in these fields have thus increasingly avoided the biases and inefficiencies caused by ad hoc methods like listwise deletion and best guess imputation. However, researchers in much of comparative politics and international relations, and others with similar data, have been unable to do the same because the best available imputation methods work poorly with the time-series cross-section data structures common in these fields. We attempt to rectify this situation. First, we build a multiple i mputation model that allows smooth time trends, shifts across cross-sectional units, and correlations over time and space, resulting in far more accurate imputations. Second, we build nonignorable missingness models by enabling analysts to incorporate knowledge from area studies experts via priors on individual missing cell values, rather than on difficult-to-interpret model parameters. Third, since these tasks could not be accomplished within existing imputation algorithms, in that they cannot handle as many variables as needed even in the simpler cross-sectional data for which they were designed, we also develop a new algorithm that substantially expands the range of computationally feasible data types and sizes for which multiple imputation can be used. These developments also made it possible to implement the methods introduced here in freely available open source software that is considerably more reliable than existing strategies. These developments also made it possible to implement the methods introduced here in freely available open source software, Amelia II: A Program for Missing Data, that is considerably more reliable than existing strategies. See also: Missing Data

  4. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    • figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  5. Data from: Missing data estimation in morphometrics: how much is too much?

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel (2022). Data from: Missing data estimation in morphometrics: how much is too much? [Dataset]. http://doi.org/10.5061/dryad.f0b50
    Explore at:
    Dataset updated
    Jun 1, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien Clavel; Gildas Merceron; Gilles Escarguel; Julien Clavel; Gildas Merceron; Gilles Escarguel
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Fossil-based estimates of diversity and evolutionary dynamics mainly rely on the study of morphological variation. Unfortunately, organism remains are often altered by post-mortem taphonomic processes such as weathering or distortion. Such a loss of information often prevents quantitative multivariate description and statistically controlled comparisons of extinct species based on morphometric data. A common way to deal with missing data involves imputation methods that directly fill the missing cases with model estimates. Over the last several years, several empirically determined thresholds for the maximum acceptable proportion of missing values have been proposed in the literature, whereas other studies showed that this limit actually depends on several properties of the study dataset and of the selected imputation method, and is by no way generalizable. We evaluate the relative performances of seven multiple imputation techniques through a simulation-based analysis under three distinct patterns of missing data distribution. Overall, Fully Conditional Specification and Expectation-Maximization algorithms provide the best compromises between imputation accuracy and coverage probability. Multiple imputation (MI) techniques appear remarkably robust to the violation of basic assumptions such as the occurrence of taxonomically or anatomically biased patterns of missing data distribution, making differences in simulation results between the three patterns of missing data distribution much smaller than differences between the individual MI techniques. Based on these results, rather than proposing a new (set of) threshold value(s), we develop an approach combining the use of multiple imputations with procrustean superimposition of principal component analysis results, in order to directly visualize the effect of individual missing data imputation on an ordinated space. We provide an R function for users to implement the proposed procedure.

  6. d

    New Approach to Evaluating Supplementary Homicide Report (SHR) Data...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 [Dataset]. https://catalog.data.gov/dataset/new-approach-to-evaluating-supplementary-homicide-report-shr-data-imputation-1990-1995-ff769
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justice
    Description

    The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.

  7. f

    A Simple Optimization Workflow to Enable Precise and Accurate Imputation of...

    • acs.figshare.com
    xlsx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker (2023). A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets [Dataset]. http://doi.org/10.1021/acs.jproteome.1c00070.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    ACS Publications
    Authors
    Kruttika Dabke; Simion Kreimer; Michelle R. Jones; Sarah J. Parker
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification levelfragment levelimproved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set’s most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.

  8. f

    Dataset information.

    • plos.figshare.com
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min-Wei Huang; Chih-Fong Tsai; Shu-Ching Tsui; Wei-Chao Lin (2023). Dataset information. [Dataset]. http://doi.org/10.1371/journal.pone.0295032.t001
    Explore at:
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Min-Wei Huang; Chih-Fong Tsai; Shu-Ching Tsui; Wei-Chao Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and data analysis on medical domain problem datasets containing continuous features. However, certain feature values were frequently missing. Many data-mining algorithms cannot handle incomplete datasets. In this study, we considered the use of both discretization and missing-value imputation to process incomplete medical datasets, examining how the order of discretization and missing-value imputation combined influenced performance. The experimental results were obtained using seven different medical domain problem datasets: two discretizers, including the minimum description length principle (MDLP) and ChiMerge; three imputation methods, including the mean/mode, classification and regression tree (CART), and k-nearest neighbor (KNN) methods; and two classifiers, including support vector machines (SVM) and the C4.5 decision tree. The results show that a better performance can be obtained by first performing discretization followed by imputation, rather than vice versa. Furthermore, the highest classification accuracy rate was achieved by combining ChiMerge and KNN with SVM.

  9. Data from: Missing data handling methods

    • kaggle.com
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros (2024). Missing data handling methods [Dataset]. https://www.kaggle.com/datasets/krisztinboros/missing-data-handling-methods/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Krisztián Boros
    Description

    Dataset for the paper "Identifying missing data handling methods with text mining".
    It contains the type of missing data handling method used by a given paper.

    Column description

    id: ID of the article
    origin: Source journal
    pub_year: Publication year
    discipline: Discipline category of the article based on origin
    about_missing: Is the article about missing data handling? (0 - no, 1 - yes)
    imputation: Was some kind of imputation technique used in the article? (0 - no, 1 - yes)
    advanced: Was some kind of advanced imputation technique used in the article? (0 - no, 1 - yes)
    deletion: Was some kind of deletion technique used in the article? (0 - no, 1 - yes)
    text_tokens: Snipped out parts from the original articles

  10. f

    Data from: A multiple imputation method using population information

    • tandf.figshare.com
    pdf
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadayoshi Fushiki (2025). A multiple imputation method using population information [Dataset]. http://doi.org/10.6084/m9.figshare.28900017.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Tadayoshi Fushiki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multiple imputation (MI) is effectively used to deal with missing data when the missing mechanism is missing at random. However, MI may not be effective when the missing mechanism is not missing at random (NMAR). In such cases, additional information is required to obtain an appropriate imputation. Pham et al. (2019) proposed the calibrated-δ adjustment method, which is a multiple imputation method using population information. It provides appropriate imputation in two NMAR settings. However, the calibrated-δ adjustment method has two problems. First, it can be used only when one variable has missing values. Second, the theoretical properties of the variance estimator have not been provided. This article proposes a multiple imputation method using population information that can be applied when several variables have missing values. The proposed method is proven to include the calibrated-δ adjustment method. It is shown that the proposed method provides a consistent estimator for the parameter of the imputation model in an NMAR situation. The asymptotic variance of the estimator obtained by the proposed method and its estimator are also given.

  11. d

    Supplementary Methods: Multiple imputation of CVRF trajectories in detail

    • search.dataone.org
    • datadryad.org
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kristine Yaffe (2025). Supplementary Methods: Multiple imputation of CVRF trajectories in detail [Dataset]. http://doi.org/10.7272/Q60000BJ
    Explore at:
    Dataset updated
    May 17, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Kristine Yaffe
    Time period covered
    Jan 1, 2021
    Description

    Background: Cardiovascular risk factors (CVRFs) are associated with increased risk of cognitive decline, but little is known about how early adult CVRFs and those across the life course might influence late-life cognition. To test the hypothesis that CVRFs across the adult life course are associated with late-life cognitive changes, we pooled data from four prospective cohorts(n=15,001, ages 18-95).

    Methods: We imputed trajectories of body mass index (BMI), fasting glucose (FG), systolic blood pressure (SBP), and total cholesterol (TC) for older adults. We used linear mixed models to determine the association of early adult, mid-life, and late-life CVRFs with late-life decline on global cognition (Modified Mini-Mental State Exam (3MS)) and processing speed (Digit Symbol Substitution Test (DSST)), adjusting for demographics, education, and cohort.

    Results: Elevated BMI, FG, and SBP (but not TC) at each time period were associated with greater late-life decline. Early life CVRFs wer...

  12. d

    Replication Data for: Qualitative Imputation of Missing Potential Outcomes

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coppock, Alexander; Kaur, Dipin (2023). Replication Data for: Qualitative Imputation of Missing Potential Outcomes [Dataset]. http://doi.org/10.7910/DVN/2IVKXD
    Explore at:
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Coppock, Alexander; Kaur, Dipin
    Description

    We propose a framework for meta-analysis of qualitative causal inferences. We integrate qualitative counterfactual inquiry with an approach from the quantitative causal inference literature called extreme value bounds. Qualitative counterfactual analysis uses the observed outcome and auxiliary information to infer what would have happened had the treatment been set to a different level. Imputing missing potential outcomes is hard and when it fails, we can fill them in under best- and worst-case scenarios. We apply our approach to 63 cases that could have experienced transitional truth commissions upon democratization, 8 of which did. Prior to any analysis, the extreme value bounds around the average treatment effect on authoritarian resumption are 100 percentage points wide; imputation shrinks the width of these bounds to 51 points. We further demonstrate our method by aggregating specialists' beliefs about causal effects gathered through an expert survey, shrinking the width of the bounds to 44 points.

  13. H

    Replication data for: Modeling global health indicators: missing data...

    • dataverse.harvard.edu
    • data.niaid.nih.gov
    Updated May 5, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie mie & Gerstein Bethany (2014). Replication data for: Modeling global health indicators: missing data imputation and accounting for ‘double uncertainty’ [Dataset]. http://doi.org/10.7910/DVN/25683
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2014
    Dataset provided by
    Harvard Dataverse
    Authors
    Jamie mie & Gerstein Bethany
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2005
    Area covered
    World
    Description

    Global health indicators such as infant and maternal mortality are important for informing priorities for health research, policy development, and resource allocation. However, due to inconsistent reporting within and across nations, construction of comparable indicators often requires extensive data imputation and complex modeling from limited observed data. We draw on Ahmed et al.’s 2012 paper – an analysis of maternal deaths averted by contraceptive use for 172 countries in 2008 – as an exemplary case of the challenge of building reliable models with scarce observations. The authors’ employ a counterfactual modeling approach using regression imputation on the independent variable which assumes no estimation uncertainty in the final model and does not address the potential for scattered missingness in the predictor variables. We replicate their results and test the sensitivity of their published estimates to the use of an alternative method for imputing missing data, multiple imputation. We also calculate alternative estimates of standard errors for the model estimates that more appropriately account for the uncertainty introduced through data imputation of multiple predictor variables. Based on our results, we discuss the risks associated with the missing data practices employed and evaluate the appropriateness of multiple imputation as an alternative for data imputation and uncertainty estimation for models of global health indicators.

  14. Characteristics comparison of participants for five multiple imputation...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katya L. Masconi; Tandi Edith Matsha-Erasmus; Rajiv T. Erasmus; Andre P. Kengne (2023). Characteristics comparison of participants for five multiple imputation datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0139210.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Katya L. Masconi; Tandi Edith Matsha-Erasmus; Rajiv T. Erasmus; Andre P. Kengne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Characteristics comparison of participants for five multiple imputation datasets.

  15. State Profiles: FY 2014 Public Libraries Survey (Data)

    • data.wu.ac.at
    csv, json, rdf, xml
    Updated Dec 7, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Museum and Library Services (2016). State Profiles: FY 2014 Public Libraries Survey (Data) [Dataset]. https://data.wu.ac.at/schema/data_gov/N2RhMjA3OTctZDFiMS00NjA4LWFjMzYtMjA0ZTA5Zjg4NDIw
    Explore at:
    json, rdf, xml, csvAvailable download formats
    Dataset updated
    Dec 7, 2016
    Dataset provided by
    Institute of Museum and Library Serviceshttps://www.imls.gov/
    Description

    Pull up a state's profile to find state-level totals on key data such as numbers of libraries and librarians, revenue and expenditures, and collection sizes.

    These data include imputed values for libraries that did not submit information in the FY 2014 data collection. Imputation is a procedure for estimating a value for a specific data item where the response is missing.

    Download PLS data files to see imputation flag variables or learn more on the imputation methods used in FY 2014 at https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey/explore-pls-data/pls-data

  16. A

    State Libraries Survey, FY 1994, Part 3: Revenue & Expenditures

    • data.amerigeoss.org
    csv, json, rdf, xml
    Updated Jul 28, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). State Libraries Survey, FY 1994, Part 3: Revenue & Expenditures [Dataset]. https://data.amerigeoss.org/en/dataset/state-libraries-survey-fy-1994-part-3-revenue-expenditures
    Explore at:
    json, csv, rdf, xmlAvailable download formats
    Dataset updated
    Jul 28, 2019
    Dataset provided by
    United States[old]
    Description

    Find key information on state library agencies.

    These data include imputed values for state libraries that did not submit information in this data collection.

    Imputation is a procedure for estimating a value for a specific data item where the response is missing.

    Download SLAA data files to see imputation flag variables or learn more on the imputation methods at https://www.imls.gov/research-evaluation/data-collection/state-library-administrative-agency-survey

  17. Quarterly Labour Force Survey Household Dataset, October - December, 2022

    • beta.ukdataservice.ac.uk
    • datacatalogue.cessda.eu
    Updated 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office For National Statistics (2023). Quarterly Labour Force Survey Household Dataset, October - December, 2022 [Dataset]. http://doi.org/10.5255/ukda-sn-9064-2
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    DataCitehttps://www.datacite.org/
    Authors
    Office For National Statistics
    Description
    Background
    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    Household datasets
    Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

    Change to coding of missing values for household series
    From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS
    LFS User Guidance page before commencing analysis.

    Additional data derived from the QLFS
    The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

    End User Licence and Secure Access QLFS Household datasets
    Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

    Changes to variables in QLFS Household EUL datasets
    In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

    Review of imputation methods for LFS Household data - changes to missing values
    A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

    Occupation data for 2021 and 2022 data files

    The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

    Latest edition information

    For the second edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

  18. A

    Library Systems: FY 2012 Public Libraries Survey (Administrative Entity)

    • data.amerigeoss.org
    • data.wu.ac.at
    csv, json, rdf, xml
    Updated Aug 30, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2016). Library Systems: FY 2012 Public Libraries Survey (Administrative Entity) [Dataset]. https://data.amerigeoss.org/mk/dataset/library-systems-fy-2012-public-libraries-survey-administrative-entity
    Explore at:
    rdf, json, xml, csvAvailable download formats
    Dataset updated
    Aug 30, 2016
    Dataset provided by
    United States
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Find key information on library systems around the United States.

    These data include imputed values for libraries that did not submit information in the FY 2012 data collection. Imputation is a procedure for estimating a value for a specific data item where the response is missing.

    Download PLS data files to see imputation flag variables or learn more on the imputation methods used in FY 2012 at https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey/explore-pls-data/pls-data

  19. m

    Data from: A simple approach for maximizing the overlap of phylogenetic and...

    • figshare.mq.edu.au
    • borealisdata.ca
    • +3more
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell (2023). Data from: A simple approach for maximizing the overlap of phylogenetic and comparative data [Dataset]. http://doi.org/10.5061/dryad.5d3rq
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Macquarie University
    Authors
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.

    Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv

  20. f

    Table_1_Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed...

    • figshare.com
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haiko Schurz; Stephanie J. Müller; Paul David van Helden; Gerard Tromp; Eileen G. Hoal; Craig J. Kinnear; Marlo Möller (2023). Table_1_Evaluating the Accuracy of Imputation Methods in a Five-Way Admixed Population.DOCX [Dataset]. http://doi.org/10.3389/fgene.2019.00034.s001
    Explore at:
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Haiko Schurz; Stephanie J. Müller; Paul David van Helden; Gerard Tromp; Eileen G. Hoal; Craig J. Kinnear; Marlo Möller
    Description

    Genotype imputation is a powerful tool for increasing statistical power in an association analysis. Meta-analysis of multiple study datasets also requires a substantial overlap of SNPs for a successful association analysis, which can be achieved by imputation. Quality of imputed datasets is largely dependent on the software used, as well as the reference populations chosen. The accuracy of imputation of available reference populations has not been tested for the five-way admixed South African Colored (SAC) population. In this study, imputation results obtained using three freely-accessible methods were evaluated for accuracy and quality. We show that the African Genome Resource is the best reference panel for imputation of missing genotypes in samples from the SAC population, implemented via the freely accessible Sanger Imputation Server.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lorena Etcheverry (2021). Water-quality data imputation with a high percentage of missing values: a machine learning approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4731168

Data from: Water-quality data imputation with a high percentage of missing values: a machine learning approach

Related Article
Explore at:
Dataset updated
Jun 8, 2021
Dataset provided by
Alberto Castro
Mónica Fossati
Angela Gorgoglione
Marcos Pastorini
Rafael Rodríguez
Lorena Etcheverry
Christian Chreties
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The monitoring of surface-water quality followed by water-quality modeling and analysis is essential for generating effective strategies in water resource management. However, water-quality studies are limited by the lack of complete and reliable data sets on surface-water-quality variables. These deficiencies are particularly noticeable in developing countries.

This work focuses on surface-water-quality data from Santa Lucía Chico river (Uruguay), a mixed lotic and lentic river system. Data collected at six monitoring stations are publicly available at https://www.dinama.gub.uy/oan/datos-abiertos/calidad-agua/. The high temporal and spatial variability that characterizes water-quality variables and the high rate of missing values (between 50% and 70%) raises significant challenges.

To deal with missing values, we applied several statistical and machine-learning imputation methods. The competing algorithms implemented belonged to both univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Huber Regressor (HR), Support Vector Regressor (SVR), and K-nearest neighbors Regressor (KNNR)).

IDW outperformed the others, achieving a very good performance (NSE greater than 0.8) in most cases.

In this dataset, we include the original and imputed values for the following variables:

Water temperature (Tw)

Dissolved oxygen (DO)

Electrical conductivity (EC)

pH

Turbidity (Turb)

Nitrite (NO2-)

Nitrate (NO3-)

Total Nitrogen (TN)

Each variable is identified as [STATION] VARIABLE FULL NAME (VARIABLE SHORT NAME) [UNIT METRIC].

More details about the study area, the original datasets, and the methodology adopted can be found in our paper https://www.mdpi.com/2071-1050/13/11/6318.

If you use this dataset in your work, please cite our paper: Rodríguez, R.; Pastorini, M.; Etcheverry, L.; Chreties, C.; Fossati, M.; Castro, A.; Gorgoglione, A. Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach. Sustainability 2021, 13, 6318. https://doi.org/10.3390/su13116318

Search
Clear search
Close search
Google apps
Main menu