48 datasets found
  1. f

    Data Driven Estimation of Imputation Error—A Strategy for Imputation with a...

    • plos.figshare.com
    • figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Nikolaj Bak; Lars K. Hansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

  2. h

    Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

    • datahub.hku.hk
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
    Explore at:
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    HKU Data Repository
    Authors
    Wen Ma
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
  3. f

    DataSheet_2_A Deep Learning Approach for Missing Data Imputation of Rating...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chung-Yuan Cheng; Wan-Ling Tseng; Ching-Fen Chang; Chuan-Hsiung Chang; Susan Shur-Fen Gau (2023). DataSheet_2_A Deep Learning Approach for Missing Data Imputation of Rating Scales Assessing Attention-Deficit Hyperactivity Disorder.xlsx [Dataset]. http://doi.org/10.3389/fpsyt.2020.00673.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Chung-Yuan Cheng; Wan-Ling Tseng; Ching-Fen Chang; Chuan-Hsiung Chang; Susan Shur-Fen Gau
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A variety of tools and methods have been used to measure behavioral symptoms of attention-deficit/hyperactivity disorder (ADHD). Missing data is a major concern in ADHD behavioral studies. This study used a deep learning method to impute missing data in ADHD rating scales and evaluated the ability of the imputed dataset (i.e., the imputed data replacing the original missing values) to distinguish youths with ADHD from youths without ADHD. The data were collected from 1220 youths, 799 of whom had an ADHD diagnosis, and 421 were typically developing (TD) youths without ADHD, recruited in Northern Taiwan. Participants were assessed using the Conners’ Continuous Performance Test, the Chinese versions of the Conners’ rating scale-revised: short form for parent and teacher reports, and the Swanson, Nolan, and Pelham, version IV scale for parent and teacher reports. We used deep learning, with information from the original complete dataset (referred to as the reference dataset), to perform missing data imputation and generate an imputation order according to the imputed accuracy of each question. We evaluated the effectiveness of imputation using support vector machine to classify the ADHD and TD groups in the imputed dataset. The imputed dataset can classify ADHD vs. TD up to 89% accuracy, which did not differ from the classification accuracy (89%) using the reference dataset. Most of the behaviors related to oppositional behaviors rated by teachers and hyperactivity/impulsivity rated by both parents and teachers showed high discriminatory accuracy to distinguish ADHD from non-ADHD. Our findings support a deep learning solution for missing data imputation without introducing bias to the data.

  4. S

    Prediction of radionuclide diffusion enabled by missing data imputation and...

    • scidb.cn
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao (2025). Prediction of radionuclide diffusion enabled by missing data imputation and ensemble machine learning [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00710
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Jun-Lei Tian; Jia-Xing Feng; Jia-Cong Shen; Lei Yao; Jing-Yan Wang; Tao Wu; Yao-Lin Zhao
    Description

    Missing values in radionuclide diffusion datasets can undermine the predictive accuracy and robustness of machine learning models. A regression-based missing data imputation method using light gradient boosting machine algorithm was employed to impute over 60% of the missing data.

  5. Quarterly Labour Force Survey Household Dataset, October - December, 2021

    • beta.ukdataservice.ac.uk
    • datacatalogue.cessda.eu
    Updated 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office For National Statistics (2023). Quarterly Labour Force Survey Household Dataset, October - December, 2021 [Dataset]. http://doi.org/10.5255/ukda-sn-8925-3
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    DataCitehttps://www.datacite.org/
    Authors
    Office For National Statistics
    Description
    Background
    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    Household datasets
    Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

    Change to coding of missing values for household series
    From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS
    LFS User Guidance page before commencing analysis.

    Additional data derived from the QLFS
    The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

    End User Licence and Secure Access QLFS Household datasets
    Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

    Changes to variables in QLFS Household EUL datasets
    In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

    Review of imputation methods for LFS Household data - changes to missing values
    A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

    Occupation data for 2021 and 2022 data files

    The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

    Latest edition information

    For the third edition (September 2023), the variables NSECM20, NSECMJ20, SC2010M, SC20SMJ, SC20SMN and SOC20M have been replaced with new versions. Further information on the SOC revisions can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

  6. f

    Additional file 4 of Heckman imputation models for binary or continuous MNAR...

    • springernature.figshare.com
    txt
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon (2023). Additional file 4 of Heckman imputation models for binary or continuous MNAR outcomes and MAR predictors [Dataset]. http://doi.org/10.6084/m9.figshare.7038104.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Jacques-Emmanuel Galimard; Sylvie Chevret; Emmanuel Curis; Matthieu Resche-Rigon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R code to impute binary outcome. (R 1 kb)

  7. f

    Average (of S = 200 elapsed time values) processing time (in seconds)...

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faisal Maqbool Zahid; Shahla Faisal; Christian Heumann (2023). Average (of S = 200 elapsed time values) processing time (in seconds) required by different algorithms to impute one dataset with 10% missing values. [Dataset]. http://doi.org/10.1371/journal.pone.0254112.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Faisal Maqbool Zahid; Shahla Faisal; Christian Heumann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Average (of S = 200 elapsed time values) processing time (in seconds) required by different algorithms to impute one dataset with 10% missing values.

  8. NielsenHackathon

    • kaggle.com
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aadarsh Singh (2021). NielsenHackathon [Dataset]. https://www.kaggle.com/datasets/paradoxlover/nielsenhackathon
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aadarsh Singh
    Description

    Context

    Create a model which can help impute/extrapolate data to fill in the missing data gaps in the store level POS data currently received.

    Task:

    Build an imputation and/or extrapolation model to fill the missing data gaps for select stores by analyzing the data and determine which factors/variables/features can help best predict the store sales.

  9. GECCO Industrial Challenge 2015 Dataset: A heating system dataset for the...

    • zenodo.org
    • data.niaid.nih.gov
    csv, pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steffen Moritz; Steffen Moritz; Martina Friese; Andreas Fischbach; Christopher Schlitt; Thomas Bartz-Beielstein; Thomas Bartz-Beielstein; Martina Friese; Andreas Fischbach; Christopher Schlitt (2024). GECCO Industrial Challenge 2015 Dataset: A heating system dataset for the 'Recovering missing information in heating system operating data' competition at the Genetic and Evolutionary Computation Conference 2015, Madrid, Spain [Dataset]. http://doi.org/10.5281/zenodo.3884899
    Explore at:
    pdf, csv, zipAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steffen Moritz; Steffen Moritz; Martina Friese; Andreas Fischbach; Christopher Schlitt; Thomas Bartz-Beielstein; Thomas Bartz-Beielstein; Martina Friese; Andreas Fischbach; Christopher Schlitt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of the 'Industrial Challenge: Recovering missing information in heating system operating data' competition hosted at The Genetic and Evolutionary Computation Conference (GECCO) July 11th-15th 2015, Madrid, Spain

    The task of the competition was to recover (impute) missing information in heating system operation time series'.

    Included in zenodo:

    - dataset of heating system operational time series with missing values

    - additional material and descriptions provided for the competition

    The competition was organized by:

    M. Friese, A. Fischbach, C. Schlitt, T. Bartz-Beielstein (TH Köln)

    The dataset was provided by:

    Major German heating systems supplier (S. Moritz)

    Industrial Challenge: Recovering missing information in heating system operating data

    The Industrial Challenge will be held in the competition session at the Genetic and Evolutionary Computation Conference. It poses difficult real-world problems provided by industry partners from various fields. Highlights of the Industrial Challenge include interesting problem domains, real-world data and realistic quality measurement

    Overview

    In times of accelerating climate change and rising energy costs, increasing energy efficiency and reducing expenses becomes a high priority goal for businesses and private households alike. Modern heating systems record detailed operating data and report this data to a central system. Here, the operating data can be correlated and analyzed to detect potential optimization opportunities or anomalies like unusually high energy consumption. Due to various difficulties this data might be incomplete which makes accurate forecasting even harder.

    Goal of the GECCO 2015 Industrial Challenge is to develop capable procedures to recover missing information in heating system operating data. Adequate recovery of the missing data enables more accurate forecastings which allow for intelligent control of the heating systems, and therefore contributes to a positive energy balance and reduced expenses.

    Submission deadline:
    June 22, 2015

    Official Webpage:
    www.spotseven.de/gecco-challenge/gecco-challenge-2015/

  10. d

    Data from: A real data-driven simulation strategy to select an imputation...

    • datadryad.org
    zip
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz (2023). A real data-driven simulation strategy to select an imputation method for mixed-type trait data [Dataset]. http://doi.org/10.5061/dryad.crjdfn37m
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 15, 2023
    Dataset provided by
    Dryad
    Authors
    Jacqueline A. May; Zeny Feng; Sarah J. Adamowicz
    Time period covered
    2022
    Description

    Alignment and phylogenetic trees may be opened and visualized by software capable of handling Newick and FASTA file formats.

  11. m

    Data from: A simple approach for maximizing the overlap of phylogenetic and...

    • figshare.mq.edu.au
    • borealisdata.ca
    • +4more
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell (2023). Data from: A simple approach for maximizing the overlap of phylogenetic and comparative data [Dataset]. http://doi.org/10.5061/dryad.5d3rq
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Macquarie University
    Authors
    Matthew W. Pennell; Richard G. FitzJohn; William K. Cornwell
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Biologists are increasingly using curated, public data sets to conduct phylogenetic comparative analyses. Unfortunately, there is often a mismatch between species for which there is phylogenetic data and those for which other data are available. As a result, researchers are commonly forced to either drop species from analyses entirely or else impute the missing data. A simple strategy to improve the overlap of phylogenetic and comparative data is to swap species in the tree that lack data with ‘phylogenetically equivalent’ species that have data. While this procedure is logically straightforward, it quickly becomes very challenging to do by hand. Here, we present algorithms that use topological and taxonomic information to maximize the number of swaps without altering the structure of the phylogeny. We have implemented our method in a new R package phyndr, which will allow researchers to apply our algorithm to empirical data sets. It is relatively efficient such that taxon swaps can be quickly computed, even for large trees. To facilitate the use of taxonomic knowledge, we created a separate data package taxonlookup; it contains a curated, versioned taxonomic lookup for land plants and is interoperable with phyndr. Emerging online data bases and statistical advances are making it possible for researchers to investigate evolutionary questions at unprecedented scales. However, in this effort species mismatch among data sources will increasingly be a problem; evolutionary informatics tools, such as phyndr and taxonlookup, can help alleviate this issue.

    Usage Notes Land plant taxonomic lookup tableThis dataset is a stable version (version 1.0.1) of the dataset contained in the taxonlookup R package (see https://github.com/traitecoevo/taxonlookup for the most recent version). It contains a taxonomic reference table for 16,913 genera of land plants along with the number of recognized species in each genus.plant_lookup.csv

  12. A

    ‘🍷 Alcohol vs Life Expectancy’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Dec 24, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘🍷 Alcohol vs Life Expectancy’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-alcohol-vs-life-expectancy-bdda/590be6d0/?iid=002-384&v=presentation
    Explore at:
    Dataset updated
    Dec 24, 2016
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘🍷 Alcohol vs Life Expectancy’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/alcohol-vs-life-expectancye on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Credit: This dataset was created by Jonathan Ortiz! All credits for the original go to the original author!

    About this dataset

    Intro

    2016">https://data.world/uncc-dsba/dsba-6100-fall-2016

    Findings

    There is a surprising relationship between alcohol consumption and life expectancy. In fact, the data suggest that life expectancy and alcohol consumption are positively correlated - 1.2 additional years for every 1 liter of alcohol consumed annually. This is, of course, a spurious finding, because the correlation of this relationship is very low - 0.28. This indicates that other factors in those countries where alcohol consumption is comparatively high or low are contributing to differences in life expectancy, and further analysis is warranted.

    https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">

    Methods

    1. Addressing Missing Values

    The original drinks.csv file in the UNCC/DSBA-6100 dataset was missing values for The Bahamas, Denmark, and Macedonia for the wine, spirits, and beer attributes, respectively. Drinks_solution.csv shows these values filled in, for which I used the Mean of the rest of the data column.

    Other methods were considered and ruled out:

    • Deleting the Bahamas, Denmark, and Macedonia instances altogether - This is a possible route, but the data file itself is just under 200 rows, and there is only one observation for each country. Because the dataset is relatively small by number of instances, removal should be avoided in order to give the model more data to use.
    • Imputing missing values with k-Nearest Neighbors - Another possible route, knn impute can yield higher accuracy in certain cases when the dataset is fairly large. However, this particular dataset only contains 3 attributes, all of which seem unrelated to each other. If we had more columns with more data like availability, annual sales, preferences, etc. of the different drinks, it would be possible to predict these values with knn, but this approach should be avoided given the data we have.
    • Filling missing values with a MODE - By visualizing the data, it is easy to see that each column is fairly skewed, with many countries reporting 0 in one or more of the servings columns. Using the MODE would fill these missing entries with 0 for all three (beer_servings, spirit_servings, and wine_servings), and upon reviewing the Bahamas, Denmark, and Macedonia more closely, it is apparent that 0 would be a poor choice for the missing values, as all three countries clearly consume alcohol.
    • Filling missing values with MEDIAN - Due to the skewness mentioned just above in the MODE section, using a MEDIAN of the whole column would also be a poor choice, as the MEDIAN is pulled down by several countries reporting 0 or 1. A MEDIAN of only the observations reporting 1 or more servings--or another cutoff--could be used, however, and this would be acceptable.

    Filling missing values with MEAN - In the case of the drinks dataset, this is the best approach. The MEAN averages for the columns happen to be very close to the actual data from where we sourced this exercise. In addition, the MEAN will not skew the data, which the prior approaches would do.

    2. Calculating New Attributes

    The original drinks.csv dataset also had an empty data column: total_litres_of_pure_alcohol. This column needed to be calculated in order to do a simple 2D plot and trendline. It would have been possible to instead run a multi-variable regression on the data and therefore skip this step, but this adds an extra layer of complication to understanding the analysis - not to mention the point of the exercise is to go through an example of calculating new attributes (or "feature engineering") using domain knowledge.

    The graphic found at the Wikipedia / Standard Drink page shows the following breakdown:

    • Beer - 12 fl oz per serving - 5% average ABV
    • Wing - 5 fl oz - 12% ABV
    • Spirits - 1.5 fl oz - 40% ABV

    The conversion factor from fl oz to L is 1 fl oz : 0.0295735 L

    Therefore, the following formula was used to compute the empty column:
    total_litres_of_pure_alcohol
    =
    (beer_servings * 12 fl oz per serving * 0.05 ABV + spirit_servings * 1.5 fl oz * 0.4 ABV + wine_servings * 5 fl oz * 0.12 ABV) * 0.0295735 liters per fl oz

    3. Joining To External Data

    The lifeexpectancy.csv datafile in the https://data.world/uncc-dsba/dsba-6100-fall-2016 dataset contains life expectancy data for each country. The following query will join this data to the cleaned drinks.csv data file:

    # Life Expectancy vs Alcohol Consumption
    

    Life Expectancy vs. Alcohol Consumption with countryTable

    PREFIX drinks: <http://data.world/databeats/alcohol-vs-life-expectancy/drinks_solution.csv/drinks_solution#>
    PREFIX life: <http://data.world/uncc-dsba/dsba-6100-fall-2016/lifeexpectancy.csv/lifeexpectancy#>
    PREFIX countries: <http://data.world/databeats/alcohol-vs-life-expectancy/countryTable.csv/countryTable#>
    
    SELECT ?country ?alc ?years
    WHERE {
      SERVICE <https://query.data.world/sparql/databeats/alcohol-vs-life-expectancy> {
        ?r1 drinks:total_litres_of_pure_alcohol ?alc .
        ?r1 drinks:country ?country .
        ?r2 countries:drinksCountry ?country .
        ?r2 countries:leCountry ?leCountry .
      }
    
      SERVICE <https://query.data.world/sparql/uncc-dsba/dsba-6100-fall-2016> {
        ?r3 life:CountryDisplay ?leCountry .
        ?r3 life:GhoCode ?gho_code .
        ?r3 life:Numeric ?years .
        ?r3 life:YearCode ?reporting_year .
        ?r3 life:SexDisplay ?sex .
      }
    FILTER ( ?gho_code = "WHOSIS_000001" && ?reporting_year = 2013 && ?sex = "Both sexes" )
    }
    ORDER BY ?country
    

    4. Plotting

    The resulting joined data can then be saved to local disk and imported into any analysis tool like Excel, Numbers, R, etc. to make a simple scatterplot. A trendline and R^2 should be added to determine the relationship between Alcohol Consumption and Life Expectancy (if any).

    https://data.world/api/databeats/dataset/alcohol-vs-life-expectancy/file/raw/LifeExpectancy_v_AlcoholConsumption_Plot.jpg" alt="LifeExpectancy_v_AlcoholConsumption_Plot.jpg">

    This dataset was created by Jonathan Ortiz and contains around 200 samples along with Beer Servings, Spirit Servings, technical information and other features such as: - Total Litres Of Pure Alcohol - Wine Servings - and more.

    How to use this dataset

    • Analyze Beer Servings in relation to Spirit Servings
    • Study the influence of Total Litres Of Pure Alcohol on Wine Servings
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit Jonathan Ortiz

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  13. N

    Replication Data for: lab package

    • dataverse.lib.nycu.edu.tw
    Updated Jul 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYCU Dataverse (2023). Replication Data for: lab package [Dataset]. http://doi.org/10.57770/IOWGLV
    Explore at:
    bin(3209), type/x-r-syntax(2140), bin(2830), bin(8211), bin(6148), text/plain; charset=us-ascii(35149), type/x-r-syntax(2544), png(161791), png(140682), js(4764), html(10215), bin(2839), svg(810), type/x-r-syntax(3076), html(6053), css(7308), bin(1644), html(15314), css(11758), js(3248), application/x-rlang-transport(86971), bin(870), type/x-r-syntax(391), html(36184), png(1011), type/x-r-syntax(657), bin(1550), bin(40), bin(356), text/markdown(16357), html(16914), png(88930), html(39845), type/x-r-syntax(2564), bin(8196), png(110319), xml(1676), bin(836), type/x-r-syntax(3787), js(2018), bin(513), png(77102), application/x-rlang-transport(931), html(14233), html(11922), application/x-rlang-transport(33080506), application/x-rlang-transport(973), text/plain; charset=us-ascii(352), png(38435), bin(1364), bin(208), html(5619), bin(4400), application/x-rlang-transport(331), bin(1408), html(12074), html(4944), html(6134), type/x-r-syntax(17726), html(5513), png(55677), html(11789), type/x-r-syntax(549), html(9120), type/x-r-syntax(805), bin(2321), bin(2978), text/plain; charset=us-ascii(636), bin(784), type/x-r-syntax(1232), css(1843), jpeg(783175), html(8178), type/x-r-syntax(490), application/x-rlang-transport(22926), bin(28), html(6636), html(10904), application/x-rlang-transport(330), type/x-r-syntax(2047)Available download formats
    Dataset updated
    Jul 11, 2023
    Dataset provided by
    NYCU Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The R codes for the lab package. Sync from https://github.com/DHLab-TSENG/lab/ The proposed open-source lab package is a software tool that helps users to explore and process laboratory data in electronic health records (EHRs). With the lab package, researchers can easily map local laboratory codes to the universal standard, mark abnormal results, summarize data using descriptive statistics, impute missing values, and generate analysis ready data.

  14. f

    Associations between number of incidents or repetitions and other variables...

    • plos.figshare.com
    xls
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Estela Capelas Barbosa; Niels Blom; Annie Bunce (2025). Associations between number of incidents or repetitions and other variables in CSEW, and the imputed synthetic dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0301155.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Estela Capelas Barbosa; Niels Blom; Annie Bunce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Associations between number of incidents or repetitions and other variables in CSEW, and the imputed synthetic dataset.

  15. Z

    Turnover of the Radio Broadcasting Industry in Europe

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Antal (2024). Turnover of the Radio Broadcasting Industry in Europe [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5651179
    Explore at:
    Dataset updated
    Jul 17, 2024
    Dataset authored and provided by
    Daniel Antal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Imputed and forecasted values of the radio broadcasting industry from the Annual detailed enterprise statistics for services (NACE Rev. 2 H-N and S95) Eurostat folder.

    We use backcasting, forecasting, approxmation, last observation carry forward and next observation carry backwards to impute missing values, and to create realistic forecasts up to three periods.

    Compared to the Eurostat raw data we added value with
    Increased number of observations: 65% Reduced missing values: -48.1% Increased non-missing subset for regression or AI: +66.67%

  16. d

    Data from: Daily and Annual NO2 Concentrations for the Contiguous United...

    • catalog.data.gov
    • s.cnmilf.com
    • +3more
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEDAC (2025). Daily and Annual NO2 Concentrations for the Contiguous United States, 1-km Grids, Version 1.10 (2000-2016) [Dataset]. https://catalog.data.gov/dataset/daily-and-annual-no2-concentrations-for-the-contiguous-united-states-1-km-grids-versi-2000
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    SEDAC
    Area covered
    Contiguous United States, United States
    Description

    The Daily and Annual NO2 Concentrations for the Contiguous United States, 1-km Grids, Version 1.10 (2000-2016) data set contains daily predictions of Nitrogen Dioxide (NO2) concentrations at a high resolution (1-km grid cells) for the years 2000 to 2016. An ensemble modeling framework was used to assess NO2 levels with high accuracy, which combined estimates from three machine learning models (neural network, random forest, and gradient boosting), with a generalized additive model. Predictor variables included NO2 column concentrations from satellites, land-use variables, meteorological variables, predictions from two chemical transport models, GEOS-Chem and the U.S. Environmental Protection Agency (EPA) CommUnity Multiscale Air Quality Modeling System (CMAQ), along with other ancillary variables. The annual predictions were calculated by averaging the daily predictions for each year in each grid cell. The ensemble produced a cross-validated R-squared value of 0.79 overall, a spatial R-squared value of 0.84, and a temporal R-squared value of 0.73. In version 1.10, the completeness of daily NO2 predictions have been enhanced by employing linear interpolation to impute missing values. Specifically, for days with small spatial patches of missing data with less than 100 grid cells, inverse distance weighting interpolation was used to fill the missing grid cells. Other missing daily NO2 predictions were interpolated from the nearest days with available data. Annual predictions were updated by averaging the imputed daily predictions for each year in each grid cell. These daily and annual NO2 predictions allow public health researchers to respectively estimate the short- and long-term effects of NO2 exposures on human health, supporting the U.S. EPA for the revision of the National Ambient Air Quality Standards for daily average and annual average concentrations of NO2. The data are available in RDS and GeoTIFF formats for statistical research and geospatial analysis.

  17. South African Census 2001, CASASP imputed data - South Africa

    • microdata.worldbank.org
    • catalog.ihsn.org
    • +1more
    Updated Sep 2, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics South Africa (2014). South African Census 2001, CASASP imputed data - South Africa [Dataset]. https://microdata.worldbank.org/index.php/catalog/1272
    Explore at:
    Dataset updated
    Sep 2, 2014
    Dataset provided by
    Statistics South Africahttp://www.statssa.gov.za/
    Centre for the Analysis of South African Social Policy
    Time period covered
    2001
    Area covered
    South Africa
    Description

    Abstract

    This dataset includes imputation for missing data in key variables in the ten percent sample of the 2001 South African Census. Researchers at the Centre for the Analysis of South African Social Policy (CASASP) at the University of Oxford used sequential multiple regression techniques to impute income, education, age, gender, population group, occupation and employment status in the dataset. The main focus of the work was to impute income where it was missing or recorded as zero. The imputed results are similar to previous imputation work on the 2001 South African Census, including the single ‘hot-deck’ imputation carried out by Statistics South Africa.

    Kind of data

    Sample survey data [ssd]

    Mode of data collection

    Face-to-face [f2f]

  18. d

    Data from: Daily and Annual PM2.5 Concentrations for the Contiguous United...

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEDAC (2025). Daily and Annual PM2.5 Concentrations for the Contiguous United States, 1-km Grids, Version 1.10 (2000-2016) [Dataset]. https://catalog.data.gov/dataset/daily-and-annual-pm2-5-concentrations-for-the-contiguous-united-states-1-km-grids-ver-2000
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    SEDAC
    Area covered
    Contiguous United States, United States
    Description

    The Daily and Annual PM2.5 Concentrations for the Contiguous United States, 1-km Grids, Version 1.10 (2000-2016) data set includes predictions of PM2.5 concentration in grid cells at a resolution of 1-km for the years 2000-2016. A generalized additive model was used that accounted for geographic difference to ensemble daily predictions of three machine learning models: neural network, random forest, and gradient boosting. The three machine learners incorporated multiple predictors, including satellite data, meteorological variables, land-use variables, elevation, chemical transport model predictions, several reanalysis data sets, and others. The annual predictions were calculated by averaging the daily predictions for each year in each grid cell. The ensembled model demonstrated better predictive performance than the individual machine learners with 10-fold cross-validated R-squared values of 0.86 for daily predictions and 0.89 for annual predictions. In version 1.10, the completeness of daily PM2.5 predictions have been enhanced by employing linear interpolation to impute missing values. Specifically, for days with small spatial patches of missing data with less than 100 grid cells, inverse distance weighting interpolation was used to fill the missing grid cells. Other missing daily PM2.5 predictions were interpolated from the nearest days with available data. Annual predictions were updated by averaging the imputed daily predictions for each year in each grid cell. These daily and annual PM2.5 predictions allow public health researchers to respectively estimate the short- and long-term effects of PM2.5 exposures on human health, supporting the U.S. Environmental Protection Agency (EPA) for the revision of the National Ambient Air Quality Standards for 24-hour average and annual average concentrations of PM2.5. The data are available in RDS and GeoTIFF formats for statistical research and geospatial analysis.

  19. d

    funspace: an R package to build, analyze and plot functional trait spaces

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Perez Carmona; Nicola Pavanetto; Giacomo Puglielli (2024). funspace: an R package to build, analyze and plot functional trait spaces [Dataset]. http://doi.org/10.5061/dryad.4tmpg4fg6
    Explore at:
    Dataset updated
    Feb 29, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Carlos Perez Carmona; Nicola Pavanetto; Giacomo Puglielli
    Time period covered
    Jan 1, 2023
    Description

    Functional trait space analyses are pivotal to describe and compare organisms’ functional diversity across the tree of life. Yet, there is no single application that streamlines the many sometimes-troublesome steps needed to build and analyze functional trait spaces. To fill this gap, we propose funspace, an R package to easily handle bivariate and multivariate (PCA-based) functional trait space analyses. The six functions that constitute the package can be grouped in three modules: ‘Building and exploring’, ‘Mapping’, and ‘Plotting’. The building and exploring module defines the main features of a functional trait space (e.g., functional diversity metrics) by leveraging kernel density-based methods. The mapping module uses general additive models to map how a target variable distributes within a trait space. The plotting module provides many options for creating flexible and high-quality figures representing the outputs obtained from previous modules. We provide a worked example to dem..., , , # funspace - Creating and representing functional trait spaces

    Estimation of functional spaces based on traits of organisms. The package includes functions to impute missing trait values (with or without considering phylogenetic information), and to create, represent and analyse two dimensional functional spaces based on principal components analysis, other ordination methods, or raw traits. It also allows for mapping a third variable onto the functional space.

    Description of the Data and file structure

    We provide the package as a .tar file (filename: funspace_0.1.1.tar). Once the package has been downloaded, it can be directly uploaded in R from Packages >> Install >> Install from >> Package Archive File (.zip, .tar.gz). All the functions and example datasets included in funspace and that are necessary to reproduce the worked example in the paper will be automatically uploaded. Functions and example datasets can be then accessed using the standard syntax fu...

  20. f

    Associations between age and other variables in RCEW data, CSEW data, and...

    • plos.figshare.com
    xls
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Estela Capelas Barbosa; Niels Blom; Annie Bunce (2025). Associations between age and other variables in RCEW data, CSEW data, and the imputed synthetic dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0301155.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Estela Capelas Barbosa; Niels Blom; Annie Bunce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Associations between age and other variables in RCEW data, CSEW data, and the imputed synthetic dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nikolaj Bak; Lars K. Hansen (2023). Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option [Dataset]. http://doi.org/10.1371/journal.pone.0164464

Data Driven Estimation of Imputation Error—A Strategy for Imputation with a Reject Option

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
pdfAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Nikolaj Bak; Lars K. Hansen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Missing data is a common problem in many research fields and is a challenge that always needs careful considerations. One approach is to impute the missing values, i.e., replace missing values with estimates. When imputation is applied, it is typically applied to all records with missing values indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought to be a practical approach to help users using imputation after the informed choice to impute the missing data has been made. To do this all patterns of missing values are simulated in all complete cases, enabling calculation of the “true error” in each of these new cases. The error is then estimated for each case with missing values by weighing the “true errors” by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method. The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding to conventions that might not be warranted in the specific dataset.

Search
Clear search
Close search
Google apps
Main menu