Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/36379/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36379/terms
This study was an evaluation of multiple imputation strategies to address missing data using the New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 (ICPSR 20060) dataset.
Facebook
TwitterMissing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/20060/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/20060/terms
The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.
Facebook
TwitterMany statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set’s suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named ’lfproQC’ and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/24801/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/24801/terms
These data provide incident-level information on criminal homicides including location, circumstances, and method of offense, as well as demographic characteristics of victims and perpetrators and the relationship between the two. To adjust for unit missingness, a multiple imputation approach and a weighting scheme were adopted, resulting in a fully-imputed SHR cumulative database of criminal homicides for the years 1976-2007. Unlike other versions of the SHR files, these are limited to incidents of murder and non-negligent manslaughter, excluding justifiable homicides, negligent manslaughter and homicides related to the September 11, 2001, terrorist attacks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to specify, for each of J variables with missing values, a univariate conditional distribution given all other variables, and then to draw imputations by iterating over the J conditional distributions. Such fully conditional imputation strategies have the theoretical drawback that the conditional distributions may be incompatible. When the missingness pattern is monotone, a theoretically valid approach is to specify, for each variable with missing values, a conditional distribution given the variables with fewer or the same number of missing values and sequentially draw from these distributions. In this article, we propose the “multiple imputation by ordered monotone blocks” approach, which combines these two basic approaches by decomposing any missingness pattern into a collection of smaller “constructed” monotone missingness patterns, and iterating. We apply this strategy to impute the missing data in the AVRP interim data. Supplemental materials, including all source code and a synthetic example dataset, are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, and this requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this article, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and nonlinear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and to better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online.
Facebook
TwitterBackgroundGenotype imputation is a critical preprocessing step in genome-wide association studies (GWAS), enhancing statistical power for detecting associated single nucleotide polymorphisms (SNPs) by increasing marker size.ResultsIn response to the needs of researchers seeking user-friendly graphical tools for imputation without requiring informatics or computer expertise, we have developed weIMPUTE, a web-based imputation graphical user interface (GUI). Unlike existing genotype imputation software, weIMPUTE supports multiple imputation software, including SHAPEIT, Eagle, Minimac4, Beagle, and IMPUTE2, while encompassing the entire workflow, from quality control to data format conversion. This comprehensive platform enables both novices and experienced users to readily perform imputation tasks. For reference genotype data owners, weIMPUTE can be installed on a server or workstation, facilitating web-based imputation services without data sharing.ConclusionweIMPUTE represents a versatile imputation solution for researchers across various fields, offering the flexibility to create personalized imputation servers on different operating systems.
Facebook
TwitterThe findhap.f90 program finds haplotypes and imputes genotypes using multiple chip sets and sequence data. Program and download information can be found at the Animal Improvement Program (AIP) web site: http://aipl.arsusda.gov/software/findhap Downloads Version 4 program, example files, and executable (beta version — not quite ready for routine use on U.S. chip data, but performs better than version 3 for sequence data) Example data files for imputation study presented by VanRaden and Sun at the 2014 World Congress on Genetics Applied to Livestock Production Files include actual pedigree, simulated true genotypes, simulated sequence reads, and imputed genotypes. This example used 500 reference bulls sequenced at 4× with 1% error and containing high-density SNPs; the 250 young bulls used to test imputation had only high-density SNPs. Other examples in the study can be generated by setting other options for programs findhap4, geno2seq, and genosim. Resources in this dataset:Resource Title: FINDHAP. File Name: Web Page, url: https://www.ars.usda.gov/research/software/download/?softwareid=494&modecode=80-42-05-30 download page
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A sequential regression or chained equations imputation approach uses a Gibbs sampling-type iterative algorithm that imputes the missing values using a sequence of conditional regression models. It is a flexible approach for handling different types of variables and complex data structures. Many simulation studies have shown that the multiple imputation inferences based on this procedure have desirable repeated sampling properties. However, a theoretical weakness of this approach is that the specification of a set of conditional regression models may not be compatible with a joint distribution of the variables being imputed. Hence, the convergence properties of the iterative algorithm are not well understood. This article develops conditions for convergence and assesses the properties of inferences from both compatible and incompatible sequence of regression models. The results are established for the missing data pattern where each subject may be missing a value on at most one variable. The sequence of regression models are assumed to be empirically good fit for the data chosen by the imputer based on appropriate model diagnostics. The results are used to develop criteria for the choice of regression models. Supplementary materials for this article are available online.
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Since the early stages of public opinion research, nonresponse has been identified as an important threat to the degree to which our sample can represent the population we are interested in. Researchers have documented a trend of declining response rate over the years. However, the nonresponse rate becomes a concern only when it introduces error or bias into survey results. One way to estimate nonresponse bias is through imputation. Online panels, which maintain a pool of respondents who are invited to participate in research through electronic means, face unique opportunities as well as challenges with regards to nonresponses and their imputations. Using data from a nation-wide online panel, this paper hypothesizes that nonresponse bias may exist due to the common causes shared between response propensity and opinion placements. After testifying the common causes, imputations are made to estimate the missing values. Lastly, the differences between observed distributions on variables of interest and imputed distributions are made to show the scope of nonresponse biases. This paper finds that nonresponse biases may exist in online panels. First, the theoretical model of nonresponse bias was supported because the commoncause pattern was found in the dataset. In other words, response propensity and opinion items that are of interest appeared to share common causes including mostly demographic variables. Second, imputation analyses show that although most of the differences between imputed and measured opinions do not indicate serious biases, there were few cases in which the differences seemed to be critical. The limitations of this study, especially those of the imputation method, are discussed at the end of this chapter. Suggestions for future research are provided too.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
- If you want to know about the web scrapping process then read the blog Medium Article
- If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo
GitHub Repo
This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.
Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.
Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.
This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R code used for the analysis of data for the publication: Coumoundouros et al., Cognitive behavioural therapy self-help intervention preferences among informal caregivers of adults with chronic kidney disease: an online cross-sectional survey. BMC Nephrology
Summary of study
An online cross-sectional survey for informal caregivers (e.g. family and friends) of people living with chronic kidney disease in the United Kingdom. Study aimed to examine informal caregivers' cognitive behavioural therapy self-help intervention preferences, and describe the caregiving situation (e.g. types of care activities) and informal caregiver's mental health (depression, anxiety and stress symptoms).
Participants were eligible to participate if they were at least 18 years old, lived in the United Kingdom, and provided unpaid care to someone living with chronic kidney disease who was at least 18 years old.
The online survey included questions regarding (1) informal caregiver's characteristics; (2) care recipient's characteristics; (3) intervention preferences (e.g. content, delivery format); and (4) informal caregiver's mental health. Informal caregiver's mental health was assessed using the 21 item Depression, Anxiety, and Stress Scale (DASS-21), which is composed of three subscales measuring depression, anxiety, and stress, respectively.
Sixty-five individuals participated in the survey.
See the published article for full study details.
Description of uploaded files
1. ENTWINE_ESR14_Kidney Carer Survey Data_FULL_2022-08-30: Excel file with the complete, raw survey data. Note: the first half of participant's postal codes was collected, however this data was removed from the uploaded dataset to ensure participant anonymity.
2. ENTWINE_ESR14_Kidney Carer Survey Data_Clean DASS-21 Data_2022-08-30: Excel file with cleaned data for the DASS-21 scale. Data cleaning involved imputation of missing data if participants were missing data for one item within a subscale of the DASS-21. Missing values were imputed by finding the mean of all other items within the relevant subscale.
3. ENTWINE_ESR14_Kidney Carer Survey_KEY_2022-08-30: Excel file with key linking item labels in uploaded datasets with the corresponding survey question.
4. R Code for Kidney Carer Survey_2022-08-30: R file of R code used to analyse survey data.
5. R code for Kidney Carer Survey_PDF_2022-08-30: PDF file of R code used to analyse survey data.
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de434779https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de434779
Abstract (en): The primary purpose of the State Nonfiscal Survey is to provide basic information on public elementary and secondary school students and staff for each of the 50 states, the District of Columbia, and outlying territories (American Samoa, Guam, Puerto Rico, the Virgin Islands, and the Marshall Islands). The database provides the following information on students and staff: general information (name, address, and telephone number of the state education agency), staffing information (number of FTEs on the instructional staff, guidance counselor staff, library staff, support staff, and administrative staff), and student information (membership counts by grade, counts of high school completers, counts of high school completers by racial/ethnic breakouts, and breakouts for dropouts by grade, sex, race). ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Checked for undocumented or out-of-range codes.. All public elementary and secondary education agencies in the 50 states, the District of Columbia, United States territories (American Samoa, Guam, Puerto Rico, the Virgin Islands, and the Marshall Islands), and Department of Defense schools outside of the United States. 2006-01-18 File DOC2450.ALL.PDF was removed from any previous datasets and flagged as a study-level file, so that it will accompany all downloads.2006-01-18 File CB2450.ALL.PDF was removed from any previous datasets and flagged as a study-level file, so that it will accompany all downloads. (1) Part 2, Imputed Data, is a different version of the data in Part 1, Reported Data. The National Center for Education Statistics (NCES) imputed and adjusted some reported values in order to create a data file (Part 2) that more accurately reflects student and staff counts and improves comparability between states. Imputations are defined as cases where the missing value is not reported at all, indicating that subtotals for the category are under-reported. An imputation by NCES assigns a value to the missing item, and the subtotals containing this item increase by the amount of the imputation. Imputations and adjustments were performed on the 50 states and Washington, DC, only. Since all states and Washington, DC, reported data in this survey, these imputations and adjustments were implemented to correct for item nonresponse only. This process consisted of several stages and steps, and varied as to the nature of the missing data. No adjustments or imputations were made to high school graduates or other high school completer categories, nor were any adjustments or imputations performed on the race/ethnicity data. (2) The Instruction Manual that is included with this data collection also applies to COMMON CORE OF DATA: PUBLIC EDUCATION AGENCY UNIVERSE, 1995-1996 (ICPSR 2468) and COMMON CORE OF DATA: PUBLIC SCHOOL UNIVERSE, 1995-1996 (ICPSR 2470). (3) The codebook, data collection instrument, and instruction manual are provided as two Portable Document Format (PDF) files. The PDF file format was developed by Adobe Systems Incorporated and can be accessed using the Adobe Acrobat Reader (version 3.0 or later). Information on how to obtain a copy of the Acrobat Reader is provided through the ICPSR Website on the Internet.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/3025/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/3025/terms
This survey is a component of the Robert Wood Johnson Foundation's Health Tracking Initiative, a program designed to monitor changes within the health care system and their effects on people. Focusing on care and treatment for alcohol, drug, and mental health conditions, the survey reinterviewed respondents to the 1996-1997 CTS Household Survey (COMMUNITY TRACKING STUDY HOUSEHOLD SURVEY, 1996-1997, AND FOLLOWBACK SURVEY, 1997-1998: [UNITED STATES] [ICPSR 2524]). Topics covered by the questionnaire include (1) demographics, (2) health and daily activities, (3) mental health, (4) alcohol and illicit drug use, (5) use of medications, (6) health insurance coverage including coverage for mental health, (7) access, utilization, and quality of behavioral health care, (8) work, income, and wealth, and (9) life difficulties. Five imputed versions of the data are included in the collection for analysis with multiple imputation techniques.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a Wilson interval for binomial proportions for use with multiple imputation for missing data. Using simulation studies, we show that it can have better repeated sampling properties than the usual confidence interval for binomial proportions based on Rubin’s combining rules. Further, in contrast to the usual multiple imputation confidence interval for proportions, the multiple imputation Wilson interval is always bounded by zero and one. Supplementary material is available online.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/36677/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36677/terms
These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. This study used the National Incident-Based Reporting System (NIBRS) to explore whether changes in the 2000-2010 decade were associated with changes in the prevalence and nature of violence between and among Whites, Blacks, and Hispanics. This study also aimed to construct more accessible NIBRS cross-sectional and longitudinal databases containing race/ethnic-specific measures of violent victimization, offending, and arrest. Researchers used NIBRS extract files to examine the influence of recent social changes on violence for Whites, Blacks, and Hispanics, and used advanced imputation techniques to account for missing values on race/ethnic variables. Data for this study was also drawn from the National Historical Geographic Information System, the Census Gazetteer, and Law Enforcement Officers Killed or Assaulted (LEOKA). The collection includes 1 Stata data file with 614 cases and 159 variables and 2 Stata syntax files.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/36379/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/36379/terms
This study was an evaluation of multiple imputation strategies to address missing data using the New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 (ICPSR 20060) dataset.