This study was an evaluation of multiple imputation strategies to address missing data using the New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 (ICPSR 20060) dataset.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Missing data is a growing concern in social science research. This paper introduces novel machine-learning methods to explore imputation efficiency and its effect on missing data. The authors used Internet and public service data as the test examples. The empirical results show that the method not only verified the robustness of the positive impact of Internet penetration on the public service, but also further ensured that the machine-learning imputation method was better than random and multiple imputation, greatly improving the model’s explanatory power. The panel data after machine-learning imputation with better continuity in the time trend is feasibly analyzed, which can also be analyzed using the dynamic panel model. The long-term effects of the Internet on public services were found to be significantly stronger than the short-term effects. Finally, some mechanisms in the empirical analysis are discussed.
The purpose of the project was to learn more about patterns of homicide in the United States by strengthening the ability to make imputations for Supplementary Homicide Report (SHR) data with missing values. Supplementary Homicide Reports (SHR) and local police data from Chicago, Illinois, St. Louis, Missouri, Philadelphia, Pennsylvania, and Phoenix, Arizona, for 1990 to 1995 were merged to create a master file by linking on overlapping information on victim and incident characteristics. Through this process, 96 percent of the cases in the SHR were matched with cases in the police files. The data contain variables for three types of cases: complete in SHR, missing offender and incident information in SHR but known in police report, and missing offender and incident information in both. The merged file allows estimation of similarities and differences between the cases with known offender characteristics in the SHR and those in the other two categories. The accuracy of existing data imputation methods can be assessed by comparing imputed values in an "incomplete" dataset (the SHR), generated by the three imputation strategies discussed in the literature, with the actual values in a known "complete" dataset (combined SHR and police data). Variables from both the Supplemental Homicide Reports and the additional police report offense data include incident date, victim characteristics, offender characteristics, incident details, geographic information, as well as variables regarding the matching procedure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Imputation of well log data is a common task in the field. However a quick review of the literature reveals a lack of padronization when evaluating methods for the problem. The goal of the benchmark is to introduce a standard evaluation protocol to any imputation method for well log data.
In the proposed benchmark, three public datasets are used:
Here you can download all three datasets already preprocessed to be used with our implementation, found here.
There are six files for each fold partition for each dataset.
datasetname_fold_k_well_log_metadata_train.json
: JSON file with general information of the slices of training partition of the fold k. Contains total number of slices and the number of slices per well. datasetname_fold_k_well_log_metadata_val.json
: JSON file with general information of the slices of validation partition of the fold k. Contains total number of slices and the number of slices per well. datasetname_fold_k_well_log_slices_train.npy
: .npy (numpy) file ready to be loaded with the slices for training of the fold k already processed. When loaded should have shape of (total_slices, 256, number_of_logs)datasetname_fold_k_well_log_slices_val.npy
: .npy (numpy) file ready to be loaded with the slices for validation of the fold k already processed.datasetname_fold_k_well_log_slices_meta_train.json
: JSON file with the slices info for all slices in the training partition of the fold k. For each slice, 7 data points are provided, the last four are discarded (it would contain other information that was not used). The first three are in order the: origin well name, the starting position in that well, and the end position of the slice in that well.datasetname_fold_k_well_log_slices_meta_val.json
: JSON file with the slices info for all slices in the validation partition of the fold k.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multiple imputation (MI) has become a standard statistical technique for dealing with missing values. The CDC Anthrax Vaccine Research Program (AVRP) dataset created new challenges for MI due to the large number of variables of different types and the limited sample size. A common method for imputing missing data in such complex studies is to specify, for each of J variables with missing values, a univariate conditional distribution given all other variables, and then to draw imputations by iterating over the J conditional distributions. Such fully conditional imputation strategies have the theoretical drawback that the conditional distributions may be incompatible. When the missingness pattern is monotone, a theoretically valid approach is to specify, for each variable with missing values, a conditional distribution given the variables with fewer or the same number of missing values and sequentially draw from these distributions. In this article, we propose the “multiple imputation by ordered monotone blocks” approach, which combines these two basic approaches by decomposing any missingness pattern into a collection of smaller “constructed” monotone missingness patterns, and iterating. We apply this strategy to impute the missing data in the AVRP interim data. Supplemental materials, including all source code and a synthetic example dataset, are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundGenotype imputation is a critical preprocessing step in genome-wide association studies (GWAS), enhancing statistical power for detecting associated single nucleotide polymorphisms (SNPs) by increasing marker size.ResultsIn response to the needs of researchers seeking user-friendly graphical tools for imputation without requiring informatics or computer expertise, we have developed weIMPUTE, a web-based imputation graphical user interface (GUI). Unlike existing genotype imputation software, weIMPUTE supports multiple imputation software, including SHAPEIT, Eagle, Minimac4, Beagle, and IMPUTE2, while encompassing the entire workflow, from quality control to data format conversion. This comprehensive platform enables both novices and experienced users to readily perform imputation tasks. For reference genotype data owners, weIMPUTE can be installed on a server or workstation, facilitating web-based imputation services without data sharing.ConclusionweIMPUTE represents a versatile imputation solution for researchers across various fields, offering the flexibility to create personalized imputation servers on different operating systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In real-world networks, node attributes are often only partially observed, necessitating imputation to support analysis or enable downstream tasks. However, most existing imputation methods overlook the rich information contained within the connectivity among nodes. This research is inspired by the premise that leveraging all available information should yield improved imputation, provided a sufficient association between attributes and edges. Consequently, we introduce a joint latent space model that produces a low-dimensional representation of the data and simultaneously captures the edge and node attribute information. This model relies on the pooling of information induced by shared latent variables, thus improving the prediction of node attributes and providing a more effective attribute imputation method. Our approach uses variational inference to approximate posterior distributions for these latent variables, resulting in predictive distributions for missing values. Through numerical experiments, conducted on both simulated data and real-world networks, we demonstrate that our proposed method successfully harnesses the joint structure information and significantly improves the imputation of missing attributes, specifically when the observed information is weak. Additional results, implementation details, a Python implementation, and the code reproducing the results are available online. Supplementary materials for this article are available online.
The findhap.f90 program finds haplotypes and imputes genotypes using multiple chip sets and sequence data. Program and download information can be found at the Animal Improvement Program (AIP) web site: http://aipl.arsusda.gov/software/findhap Downloads Version 4 program, example files, and executable (beta version — not quite ready for routine use on U.S. chip data, but performs better than version 3 for sequence data) Example data files for imputation study presented by VanRaden and Sun at the 2014 World Congress on Genetics Applied to Livestock Production Files include actual pedigree, simulated true genotypes, simulated sequence reads, and imputed genotypes. This example used 500 reference bulls sequenced at 4× with 1% error and containing high-density SNPs; the 250 young bulls used to test imputation had only high-density SNPs. Other examples in the study can be generated by setting other options for programs findhap4, geno2seq, and genosim. Resources in this dataset:Resource Title: FINDHAP. File Name: Web Page, url: https://www.ars.usda.gov/research/software/download/?softwareid=494&modecode=80-42-05-30 download page
These data provide incident-level information on criminal homicides including location, circumstances, and method of offense, as well as demographic characteristics of victims and perpetrators and the relationship between the two. To adjust for unit missingness, a multiple imputation approach and a weighting scheme were adopted, resulting in a fully-imputed SHR cumulative database of criminal homicides for the years 1976-2005. Unlike other versions of the SHR files, these are limited to incidents of murder and non-negligent manslaughter, excluding justifiable homicides, negligent manslaughter and homicides related to the September 11, 2001, terrorist attacks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R code used for the analysis of data for the publication: Coumoundouros et al., Cognitive behavioural therapy self-help intervention preferences among informal caregivers of adults with chronic kidney disease: an online cross-sectional survey. BMC Nephrology
Summary of study
An online cross-sectional survey for informal caregivers (e.g. family and friends) of people living with chronic kidney disease in the United Kingdom. Study aimed to examine informal caregivers' cognitive behavioural therapy self-help intervention preferences, and describe the caregiving situation (e.g. types of care activities) and informal caregiver's mental health (depression, anxiety and stress symptoms).
Participants were eligible to participate if they were at least 18 years old, lived in the United Kingdom, and provided unpaid care to someone living with chronic kidney disease who was at least 18 years old.
The online survey included questions regarding (1) informal caregiver's characteristics; (2) care recipient's characteristics; (3) intervention preferences (e.g. content, delivery format); and (4) informal caregiver's mental health. Informal caregiver's mental health was assessed using the 21 item Depression, Anxiety, and Stress Scale (DASS-21), which is composed of three subscales measuring depression, anxiety, and stress, respectively.
Sixty-five individuals participated in the survey.
See the published article for full study details.
Description of uploaded files
ENTWINE_ESR14_Kidney Carer Survey Data_FULL_2022-08-30: Excel file with the complete, raw survey data. Note: the first half of participant's postal codes was collected, however this data was removed from the uploaded dataset to ensure participant anonymity.
ENTWINE_ESR14_Kidney Carer Survey Data_Clean DASS-21 Data_2022-08-30: Excel file with cleaned data for the DASS-21 scale. Data cleaning involved imputation of missing data if participants were missing data for one item within a subscale of the DASS-21. Missing values were imputed by finding the mean of all other items within the relevant subscale.
ENTWINE_ESR14_Kidney Carer Survey_KEY_2022-08-30: Excel file with key linking item labels in uploaded datasets with the corresponding survey question.
R Code for Kidney Carer Survey_2022-08-30: R file of R code used to analyse survey data.
R code for Kidney Carer Survey_PDF_2022-08-30: PDF file of R code used to analyse survey data.
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Since the early stages of public opinion research, nonresponse has been identified as an important threat to the degree to which our sample can represent the population we are interested in. Researchers have documented a trend of declining response rate over the years. However, the nonresponse rate becomes a concern only when it introduces error or bias into survey results. One way to estimate nonresponse bias is through imputation. Online panels, which maintain a pool of respondents who are invited to participate in research through electronic means, face unique opportunities as well as challenges with regards to nonresponses and their imputations. Using data from a nation-wide online panel, this paper hypothesizes that nonresponse bias may exist due to the common causes shared between response propensity and opinion placements. After testifying the common causes, imputations are made to estimate the missing values. Lastly, the differences between observed distributions on variables of interest and imputed distributions are made to show the scope of nonresponse biases. This paper finds that nonresponse biases may exist in online panels. First, the theoretical model of nonresponse bias was supported because the commoncause pattern was found in the dataset. In other words, response propensity and opinion items that are of interest appeared to share common causes including mostly demographic variables. Second, imputation analyses show that although most of the differences between imputed and measured opinions do not indicate serious biases, there were few cases in which the differences seemed to be critical. The limitations of this study, especially those of the imputation method, are discussed at the end of this chapter. Suggestions for future research are provided too.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A sequential regression or chained equations imputation approach uses a Gibbs sampling-type iterative algorithm that imputes the missing values using a sequence of conditional regression models. It is a flexible approach for handling different types of variables and complex data structures. Many simulation studies have shown that the multiple imputation inferences based on this procedure have desirable repeated sampling properties. However, a theoretical weakness of this approach is that the specification of a set of conditional regression models may not be compatible with a joint distribution of the variables being imputed. Hence, the convergence properties of the iterative algorithm are not well understood. This article develops conditions for convergence and assesses the properties of inferences from both compatible and incompatible sequence of regression models. The results are established for the missing data pattern where each subject may be missing a value on at most one variable. The sequence of regression models are assumed to be empirically good fit for the data chosen by the imputer based on appropriate model diagnostics. The results are used to develop criteria for the choice of regression models. Supplementary materials for this article are available online.
https://www.icpsr.umich.edu/web/ICPSR/studies/24801/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/24801/terms
These data provide incident-level information on criminal homicides including location, circumstances, and method of offense, as well as demographic characteristics of victims and perpetrators and the relationship between the two. To adjust for unit missingness, a multiple imputation approach and a weighting scheme were adopted, resulting in a fully-imputed SHR cumulative database of criminal homicides for the years 1976-2007. Unlike other versions of the SHR files, these are limited to incidents of murder and non-negligent manslaughter, excluding justifiable homicides, negligent manslaughter and homicides related to the September 11, 2001, terrorist attacks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains supplementary materials for the published manuscript.
We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full-quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM), including OnTheMap for Emergency Management. We account for errors due to coverage; record-level non-response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10.
https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de434779https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de434779
Abstract (en): The primary purpose of the State Nonfiscal Survey is to provide basic information on public elementary and secondary school students and staff for each of the 50 states, the District of Columbia, and outlying territories (American Samoa, Guam, Puerto Rico, the Virgin Islands, and the Marshall Islands). The database provides the following information on students and staff: general information (name, address, and telephone number of the state education agency), staffing information (number of FTEs on the instructional staff, guidance counselor staff, library staff, support staff, and administrative staff), and student information (membership counts by grade, counts of high school completers, counts of high school completers by racial/ethnic breakouts, and breakouts for dropouts by grade, sex, race). ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Checked for undocumented or out-of-range codes.. All public elementary and secondary education agencies in the 50 states, the District of Columbia, United States territories (American Samoa, Guam, Puerto Rico, the Virgin Islands, and the Marshall Islands), and Department of Defense schools outside of the United States. 2006-01-18 File DOC2450.ALL.PDF was removed from any previous datasets and flagged as a study-level file, so that it will accompany all downloads.2006-01-18 File CB2450.ALL.PDF was removed from any previous datasets and flagged as a study-level file, so that it will accompany all downloads. (1) Part 2, Imputed Data, is a different version of the data in Part 1, Reported Data. The National Center for Education Statistics (NCES) imputed and adjusted some reported values in order to create a data file (Part 2) that more accurately reflects student and staff counts and improves comparability between states. Imputations are defined as cases where the missing value is not reported at all, indicating that subtotals for the category are under-reported. An imputation by NCES assigns a value to the missing item, and the subtotals containing this item increase by the amount of the imputation. Imputations and adjustments were performed on the 50 states and Washington, DC, only. Since all states and Washington, DC, reported data in this survey, these imputations and adjustments were implemented to correct for item nonresponse only. This process consisted of several stages and steps, and varied as to the nature of the missing data. No adjustments or imputations were made to high school graduates or other high school completer categories, nor were any adjustments or imputations performed on the race/ethnicity data. (2) The Instruction Manual that is included with this data collection also applies to COMMON CORE OF DATA: PUBLIC EDUCATION AGENCY UNIVERSE, 1995-1996 (ICPSR 2468) and COMMON CORE OF DATA: PUBLIC SCHOOL UNIVERSE, 1995-1996 (ICPSR 2470). (3) The codebook, data collection instrument, and instruction manual are provided as two Portable Document Format (PDF) files. The PDF file format was developed by Adobe Systems Incorporated and can be accessed using the Adobe Acrobat Reader (version 3.0 or later). Information on how to obtain a copy of the Acrobat Reader is provided through the ICPSR Website on the Internet.
These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed. This study used the National Incident-Based Reporting System (NIBRS) to explore whether changes in the 2000-2010 decade were associated with changes in the prevalence and nature of violence between and among Whites, Blacks, and Hispanics. This study also aimed to construct more accessible NIBRS cross-sectional and longitudinal databases containing race/ethnic-specific measures of violent victimization, offending, and arrest. Researchers used NIBRS extract files to examine the influence of recent social changes on violence for Whites, Blacks, and Hispanics, and used advanced imputation techniques to account for missing values on race/ethnic variables. Data for this study was also drawn from the National Historical Geographic Information System, the Census Gazetteer, and Law Enforcement Officers Killed or Assaulted (LEOKA). The collection includes 1 Stata data file with 614 cases and 159 variables and 2 Stata syntax files.
harmonized_imputed_gwas.tar
contains 114 publicly available GWAS traits, harmonized and imputed to GTEx v8 reference * gwas_metadata.txt
is a table with useful information about each trait, such as: - Tag: trait name (also in the file name) - PUBMED_Paper_Link: PUBMED or publication URL (if available) - Portal: URL to web portal from which data was downloaded - Consortium: GWAS Consortium authoring the data - Sample_Size: number of individuals covered in the study - Population: individuals'ancestry (EUR, EAS, etc) - abbreviation: short name used for figures - new_abbreviation: alternative name for additional figures - Deflation: whether imputed summary statistics exhibited deflation (i.e. association p-values are lower than expected by chance. The summary statistics imputation method is conservative, and in public GWAS with few observed variants (<2M), the distribution of p-values lags towards lower significance spectrums. # Data usage policy When using this data, you must acknowledge the source by citing the publication "Widespread dose-dependent effects of RNA expression and splicing on complex diseases and traits" (https://doi.org/10.1101/814350). # Disclaimer The data is provided "as is", and the authors assume no responsibility for errors or omissions. The User assumes the entire risk associated with its use of these data. The authors shall not be held liable for any use or misuse of the data described and/or contained herein. The User bears all responsibility in determining whether these data are fit for the User's intended use. The information contained in these data is not better than the original sources from which they were derived, and both scale and accuracy may vary across the data set. These data may not have the accuracy, resolution, completeness, timeliness, or other characteristics appropriate for applications that potential users of the data may contemplate. The user is responsible to comply with any data usage policy from the original GWAS studies; refer to the list of traits described here to identify their respective Consortia's requirements. THE DATA IS PROVIDED WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATA OR THE USE OR OTHER DEALINGS IN THE DATA.https://www.icpsr.umich.edu/web/ICPSR/studies/3025/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/3025/terms
This survey is a component of the Robert Wood Johnson Foundation's Health Tracking Initiative, a program designed to monitor changes within the health care system and their effects on people. Focusing on care and treatment for alcohol, drug, and mental health conditions, the survey reinterviewed respondents to the 1996-1997 CTS Household Survey (COMMUNITY TRACKING STUDY HOUSEHOLD SURVEY, 1996-1997, AND FOLLOWBACK SURVEY, 1997-1998: [UNITED STATES] [ICPSR 2524]). Topics covered by the questionnaire include (1) demographics, (2) health and daily activities, (3) mental health, (4) alcohol and illicit drug use, (5) use of medications, (6) health insurance coverage including coverage for mental health, (7) access, utilization, and quality of behavioral health care, (8) work, income, and wealth, and (9) life difficulties. Five imputed versions of the data are included in the collection for analysis with multiple imputation techniques.
Censuses are principal means of collecting basic population and housing statistics required for social and economic development, policy interventions, their implementation and evaluation.The census plays an essential role in public administration. The results are used to ensure: • equity in distribution of government services • distributing and allocating government funds among various regions and districts for education and health services • delineating electoral districts at national and local levels, and • measuring the impact of industrial development, to name a few The census also provides the benchmark for all surveys conducted by the national statistical office. Without the sampling frame derived from the census, the national statistical system would face difficulties in providing reliable official statistics for use by government and the public. Census also provides information on small areas and population groups with minimum sampling errors. This is important, for example, in planning the location of a school or clinic. Census information is also invaluable for use in the private sector for activities such as business planning and market analyses. The information is used as a benchmark in research and analysis.
Census 2011 was the third democratic census to be conducted in South Africa. Census 2011 specific objectives included: - To provide statistics on population, demographic, social, economic and housing characteristics; - To provide a base for the selection of a new sampling frame; - To provide data at lowest geographical level; and - To provide a primary base for the mid-year projections.
National
Households, Individuals
Census/enumeration data [cen]
Face-to-face [f2f]
About the Questionnaire : Much emphasis has been placed on the need for a population census to help government direct its development programmes, but less has been written about how the census questionnaire is compiled. The main focus of a population and housing census is to take stock and produce a total count of the population without omission or duplication. Another major focus is to be able to provide accurate demographic and socio-economic characteristics pertaining to each individual enumerated. Apart from individuals, the focus is on collecting accurate data on housing characteristics and services.A population and housing census provides data needed to facilitate informed decision-making as far as policy formulation and implementation are concerned, as well as to monitor and evaluate their programmes at the smallest area level possible. It is therefore important that Statistics South Africa collects statistical data that comply with the United Nations recommendations and other relevant stakeholder needs.
The United Nations underscores the following factors in determining the selection of topics to be investigated in population censuses: a) The needs of a broad range of data users in the country; b) Achievement of the maximum degree of international comparability, both within regions and on a worldwide basis; c) The probable willingness and ability of the public to give adequate information on the topics; and d) The total national resources available for conducting a census.
In addition, the UN stipulates that census-takers should avoid collecting information that is no longer required simply because it was traditionally collected in the past, but rather focus on key demographic, social and socio-economic variables.It becomes necessary, therefore, in consultation with a broad range of users of census data, to review periodically the topics traditionally investigated and to re-evaluate the need for the series to which they contribute, particularly in the light of new data needs and alternative data sources that may have become available for investigating topics formerly covered in the population census. It was against this background that Statistics South Africa conducted user consultations in 2008 after the release of some of the Community Survey products. However, some groundwork in relation to core questions recommended by all countries in Africa has been done. In line with users' meetings, the crucial demands of the Millennium Development Goals (MDGs) should also be met. It is also imperative that Stats SA meet the demands of the users that require small area data.
Accuracy of data depends on a well-designed questionnaire that is short and to the point. The interview to complete the questionnaire should not take longer than 18 minutes per household. Accuracy also depends on the diligence of the enumerator and honesty of the respondent.On the other hand, disadvantaged populations, owing to their small numbers, are best covered in the census and not in household sample surveys.Variables such as employment/unemployment, religion, income, and language are more accurately covered in household surveys than in censuses.Users'/stakeholders' input in terms of providing information in the planning phase of the census is crucial in making it a success. However, the information provided should be within the scope of the census.
Individual particulars Section A: Demographics Section B: Migration Section C: General Health and Functioning Section D: Parental Survival and Income Section E: Education Section F: Employment Section G: Fertility (Women 12-50 Years Listed) Section H: Housing, Household Goods and Services and Agricultural Activities Section I: Mortality in the Last 12 Months The Household Questionnaire is available in Afrikaans; English; isiZulu; IsiNdebele; Sepedi; SeSotho; SiSwati;Tshivenda;Xitsonga
The Transient and Tourist Hotel Questionnaire (English) is divided into the following sections:
Name, Age, Gender, Date of Birth, Marital Status, Population Group, Country of birth, Citizenship, Province.
The Questionnaire for Institutions (English) is divided into the following sections:
Particulars of the institution
Availability of piped water for the institution
Main source of water for domestic use
Main type of toilet facility
Type of energy/fuel used for cooking, heating and lighting at the institution
Disposal of refuse or rubbish
Asset ownership (TV, Radio, Landline telephone, Refrigerator, Internet facilities)
List of persons in the institution on census night (name, date of birth, sex, population group, marital status, barcode number)
The Post Enumeration Survey Questionnaire (English)
These questionnaires are provided as external resources.
Data editing and validation system The execution of each phase of Census operations introduces some form of errors in Census data. Despite quality assurance methodologies embedded in all the phases; data collection, data capturing (both manual and automated), coding, and editing, a number of errors creep in and distort the collected information. To promote consistency and improve on data quality, editing is a paramount phase in identifying and minimising errors such as invalid values, inconsistent entries or unknown/missing values. The editing process for Census 2011 was based on defined rules (specifications).
The editing of Census 2011 data involved a number of sequential processes: selection of members of the editing team, review of Census 2001 and 2007 Community Survey editing specifications, development of editing specifications for the Census 2011 pre-tests (2009 pilot and 2010 Dress Rehearsal), development of firewall editing specifications and finalisation of specifications for the main Census.
Editing team The Census 2011 editing team was drawn from various divisions of the organisation based on skills and experience in data editing. The team thus composed of subject matter specialists (demographers and programmers), managers as well as data processors. Census 2011 editing team was drawn from various divisions of the organization based on skills and experience in data editing. The team thus composed of subject matter specialists (demographers and programmers), managers as well as data processors.
The Census 2011 questionnaire was very complex, characterised by many sections, interlinked questions and skipping instructions. Editing of such complex, interlinked data items required application of a combination of editing techniques. Errors relating to structure were resolved using structural query language (SQL) in Oracle dataset. CSPro software was used to resolve content related errors. The strategy used for Census 2011 data editing was implementation of automated error detection and correction with minimal changes. Combinations of logical and dynamic imputation/editing were used. Logical imputations were preferred, and in many cases substantial effort was undertaken to deduce a consistent value based on the rest of the household’s information. To profile the extent of changes in the dataset and assess the effects of imputation, a set of imputation flags are included in the edited dataset. Imputation flags values include the following: 0 no imputation was performed; raw data were preserved 1 Logical editing was performed, raw data were blank 2 logical editing was performed, raw data were not blank 3 hot-deck imputation was performed, raw data were blank 4 hot-deck imputation was performed, raw data were not blank
Independent monitoring and evaluation of Census field activities Independent monitoring of the Census 2011 field activities was carried out by a team of 31 professionals and 381 Monitoring
This study was an evaluation of multiple imputation strategies to address missing data using the New Approach to Evaluating Supplementary Homicide Report (SHR) Data Imputation, 1990-1995 (ICPSR 20060) dataset.