99 datasets found
  1. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  2. A dataset from a survey investigating disciplinary differences in data...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, csv, pdf, txt
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein (2024). A dataset from a survey investigating disciplinary differences in data citation [Dataset]. http://doi.org/10.5281/zenodo.7555363
    Explore at:
    csv, txt, pdf, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anton Boudreau Ninkov; Anton Boudreau Ninkov; Chantal Ripp; Chantal Ripp; Kathleen Gregory; Kathleen Gregory; Isabella Peters; Isabella Peters; Stefanie Haustein; Stefanie Haustein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GENERAL INFORMATION

    Title of Dataset: A dataset from a survey investigating disciplinary differences in data citation

    Date of data collection: January to March 2022

    Collection instrument: SurveyMonkey

    Funding: Alfred P. Sloan Foundation


    SHARING/ACCESS INFORMATION

    Licenses/restrictions placed on the data: These data are available under a CC BY 4.0 license

    Links to publications that cite or use the data:

    Gregory, K., Ninkov, A., Ripp, C., Peters, I., & Haustein, S. (2022). Surveying practices of data citation and reuse across disciplines. Proceedings of the 26th International Conference on Science and Technology Indicators. International Conference on Science and Technology Indicators, Granada, Spain. https://doi.org/10.5281/ZENODO.6951437

    Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data:
    A survey investigating disciplinary differences in data citation.
    Zenodo. https://doi.org/10.5281/zenodo.7555266


    DATA & FILE OVERVIEW

    File List

    • Filename: MDCDatacitationReuse2021Codebook.pdf
      Codebook
    • Filename: MDCDataCitationReuse2021surveydata.csv
      Dataset format in csv
    • Filename: MDCDataCitationReuse2021surveydata.sav
      Dataset format in SPSS
    • Filename: MDCDataCitationReuseSurvey2021QNR.pdf
      Questionnaire

    Additional related data collected that was not included in the current data package: Open ended questions asked to respondents


    METHODOLOGICAL INFORMATION

    Description of methods used for collection/generation of data:

    The development of the questionnaire (Gregory et al., 2022) was centered around the creation of two main branches of questions for the primary groups of interest in our study: researchers that reuse data (33 questions in total) and researchers that do not reuse data (16 questions in total). The population of interest for this survey consists of researchers from all disciplines and countries, sampled from the corresponding authors of papers indexed in the Web of Science (WoS) between 2016 and 2020.

    Received 3,632 responses, 2,509 of which were completed, representing a completion rate of 68.6%. Incomplete responses were excluded from the dataset. The final total contains 2,492 complete responses and an uncorrected response rate of 1.57%. Controlling for invalid emails, bounced emails and opt-outs (n=5,201) produced a response rate of 1.62%, similar to surveys using comparable recruitment methods (Gregory et al., 2020).

    Methods for processing the data:

    Results were downloaded from SurveyMonkey in CSV format and were prepared for analysis using Excel and SPSS by recoding ordinal and multiple choice questions and by removing missing values.

    Instrument- or software-specific information needed to interpret the data:

    The dataset is provided in SPSS format, which requires IBM SPSS Statistics. The dataset is also available in a coded format in CSV. The Codebook is required to interpret to values.


    DATA-SPECIFIC INFORMATION FOR: MDCDataCitationReuse2021surveydata

    Number of variables: 94

    Number of cases/rows: 2,492

    Missing data codes: 999 Not asked

    Refer to MDCDatacitationReuse2021Codebook.pdf for detailed variable information.

  3. Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods...

    • icpsr.umich.edu
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scharfstein, Daniel O. (2025). Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods Study], 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39492.v1
    Explore at:
    Dataset updated
    Sep 15, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Scharfstein, Daniel O.
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39492/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39492/terms

    Time period covered
    2013 - 2018
    Area covered
    United States
    Description

    Clinical trials study the effects of medical treatments, like how safe they are and how well they work. But most clinical trials don't get all the data they need from patients. Patients may not answer all questions on a survey, or they may drop out of a study after it has started. The missing data can affect researchers' ability to detect the effects of treatments. To address the problem of missing data, researchers can make different guesses based on why and how data are missing. Then they can look at results for each guess. If results based on different guesses are similar, researchers can have more confidence that the study results are accurate. In this study, the research team created new methods to do these tests and developed software that runs these tests. To access the sensitivity analysis methods and software, please visit the MissingDataMatters website.

  4. Percentage (%) and number (n) of missing values in the outcome (maximum grip...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data.

  5. Datasheet3_Assessing disparities through missing race and ethnicity data:...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jul 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan (2024). Datasheet3_Assessing disparities through missing race and ethnicity data: results from a juvenile arthritis registry.pdf [Dataset]. http://doi.org/10.3389/fped.2024.1430981.s003
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionEnsuring high-quality race and ethnicity data within the electronic health record (EHR) and across linked systems, such as patient registries, is necessary to achieving the goal of inclusion of racial and ethnic minorities in scientific research and detecting disparities associated with race and ethnicity. The project goal was to improve race and ethnicity data completion within the Pediatric Rheumatology Care Outcomes Improvement Network and assess impact of improved data completion on conclusions drawn from the registry.MethodsThis is a mixed-methods quality improvement study that consisted of five parts, as follows: (1) Identifying baseline missing race and ethnicity data, (2) Surveying current collection and entry, (3) Completing data through audit and feedback cycles, (4) Assessing the impact on outcome measures, and (5) Conducting participant interviews and thematic analysis.ResultsAcross six participating centers, 29% of the patients were missing data on race and 31% were missing data on ethnicity. Of patients missing data, most patients were missing both race and ethnicity. Rates of missingness varied by data entry method (electronic vs. manual). Recovered data had a higher percentage of patients with Other race or Hispanic/Latino ethnicity compared with patients with non-missing race and ethnicity data at baseline. Black patients had a significantly higher odds ratio of having a clinical juvenile arthritis disease activity score (cJADAS10) of ≥5 at first follow-up compared with White patients. There was no significant change in odds ratio of cJADAS10 ≥5 for race and ethnicity after data completion. Patients missing race and ethnicity were more likely to be missing cJADAS values, which may affect the ability to detect changes in odds ratio of cJADAS ≥5 after completion.ConclusionsAbout one-third of the patients in a pediatric rheumatology registry were missing race and ethnicity data. After three audit and feedback cycles, centers decreased missing data by 94%, primarily via data recovery from the EHR. In this sample, completion of missing data did not change the findings related to differential outcomes by race. Recovered data were not uniformly distributed compared with those with non-missing race and ethnicity data at baseline, suggesting that differences in outcomes after completing race and ethnicity data may be seen with larger sample sizes.

  6. Z

    Base rates of food safety practices in European households: Summary data...

    • data.niaid.nih.gov
    Updated Nov 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scholderer, Joachim (2022). Base rates of food safety practices in European households: Summary data from the SafeConsume Household Survey [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7264924
    Explore at:
    Dataset updated
    Nov 4, 2022
    Dataset provided by
    University of Zurich
    Authors
    Scholderer, Joachim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains estimates of the base rates of 550 food safety-relevant food handling practices in European households. The data are representative for the population of private households in the ten European countries in which the SafeConsume Household Survey was conducted (Denmark, France, Germany, Greece, Hungary, Norway, Portugal, Romania, Spain, UK).

    Sampling design

    In each of the ten EU and EEA countries where the survey was conducted (Denmark, France, Germany, Greece, Hungary, Norway, Portugal, Romania, Spain, UK), the population under study was defined as the private households in the country. Sampling was based on a stratified random design, with the NUTS2 statistical regions of Europe and the education level of the target respondent as stratum variables. The target sample size was 1000 households per country, with selection probability within each country proportional to stratum size.

    Fieldwork

    The fieldwork was conducted between December 2018 and April 2019 in ten EU and EEA countries (Denmark, France, Germany, Greece, Hungary, Norway, Portugal, Romania, Spain, United Kingdom). The target respondent in each household was the person with main or shared responsibility for food shopping in the household. The fieldwork was sub-contracted to a professional research provider (Dynata, formerly Research Now SSI). Complete responses were obtained from altogether 9996 households.

    Weights

    In addition to the SafeConsume Household Survey data, population data from Eurostat (2019) were used to calculate weights. These were calculated with NUTS2 region as the stratification variable and assigned an influence to each observation in each stratum that was proportional to how many households in the population stratum a household in the sample stratum represented. The weights were used in the estimation of all base rates included in the data set.

    Transformations

    All survey variables were normalised to the [0,1] range before the analysis. Responses to food frequency questions were transformed into the proportion of all meals consumed during a year where the meal contained the respective food item. Responses to questions with 11-point Juster probability scales as the response format were transformed into numerical probabilities. Responses to questions with time (hours, days, weeks) or temperature (C) as response formats were discretised using supervised binning. The thresholds best separating between the bins were chosen on the basis of five-fold cross-validated decision trees. The binned versions of these variables, and all other input variables with multiple categorical response options (either with a check-all-that-apply or forced-choice response format) were transformed into sets of binary features, with a value 1 assigned if the respective response option had been checked, 0 otherwise.

    Treatment of missing values

    In many cases, a missing value on a feature logically implies that the respective data point should have a value of zero. If, for example, a participant in the SafeConsume Household Survey had indicated that a particular food was not consumed in their household, the participant was not presented with any other questions related to that food, which automatically results in missing values on all features representing the responses to the skipped questions. However, zero consumption would also imply a zero probability that the respective food is consumed undercooked. In such cases, missing values were replaced with a value of 0.

  7. 🦠 COVID-19 survey of National Statistical Offices

    • kaggle.com
    zip
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    meer atif magsi (2023). 🦠 COVID-19 survey of National Statistical Offices [Dataset]. https://www.kaggle.com/datasets/meeratif/covid-19-survey-of-national-statistical-offices
    Explore at:
    zip(22535 bytes)Available download formats
    Dataset updated
    Sep 10, 2023
    Authors
    meer atif magsi
    Description

    Global COVID-19 surveys conducted by National Statistical Offices. This dataset has several columns that contain different types of information. Here's a brief explanation of each column:

    1.**Country**: This column likely contains the names of the countries for which the survey data is collected. Each row represents data related to a specific country.

    2.**Category**: This column might contain information about the type or category of the survey. It could include categories such as healthcare, economic impact, public sentiment, etc. This helps in categorizing the surveys.

    3.**Title and Link**: These columns may contain the title or name of the specific survey and a link to the source or webpage where more information about the survey can be found. The link can be useful for referencing the original source of the data.

    4.**Description**: This column likely contains a brief description or summary of the survey's objectives, methodology, or key findings. It provides additional context for the survey data.

    5.**Source**: This column may contain information about the organization or agency that conducted the survey. It's essential for understanding the authority behind the data.

    6.**Date Added**: This column probably contains the date when the survey data was added to the dataset. This helps track the freshness of the data and can be useful for historical analysis.

    With this dataset, you can perform various types of analysis, including but not limited to:

    • Country-based analysis: You can analyze survey data for specific countries to understand the impact of COVID-19 in different regions.

    • Category-based analysis: You can group surveys by category and analyze trends or patterns related to healthcare, economics, or public sentiment.

    • Temporal analysis: You can examine how survey data has evolved over time by using the "Date Added" column to track changes and trends.

    • Source-based analysis: You can assess the reliability and credibility of the data by considering the source of the surveys.

    • Data visualization: Create visual representations like charts, graphs, and maps to make the data more understandable and informative.

    Before conducting any analysis, it's essential to clean and preprocess the data, handle missing values, and ensure data consistency. Additionally, consider the research questions or insights you want to gain from the dataset, which will guide your analysis approach.

  8. 2

    QLFS

    • datacatalogue.ukdataservice.ac.uk
    Updated Sep 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics (2025). QLFS [Dataset]. http://doi.org/10.5255/UKDA-SN-9445-1
    Explore at:
    Dataset updated
    Sep 16, 2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Office for National Statistics
    Area covered
    United Kingdom
    Description
    Background
    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    Household datasets
    Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

    Change to coding of missing values for household series
    From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS
    LFS User Guidance page before commencing analysis.

    Additional data derived from the QLFS
    The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

    End User Licence and Secure Access QLFS Household datasets
    Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

    Changes to variables in QLFS Household EUL datasets
    In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

    Review of imputation methods for LFS Household data - changes to missing values
    A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

    Occupation data for 2021 and 2022 data files

    The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

  9. Young People Survey

    • kaggle.com
    zip
    Updated Dec 6, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miroslav Sabo (2016). Young People Survey [Dataset]. https://www.kaggle.com/miroslavsabo/young-people-survey
    Explore at:
    zip(85769 bytes)Available download formats
    Dataset updated
    Dec 6, 2016
    Authors
    Miroslav Sabo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    In 2013, students of the Statistics class at "https://fses.uniba.sk/en/">FSEV UK were asked to invite their friends to participate in this survey.

    • The data file (responses.csv) consists of 1010 rows and 150 columns (139 integer and 11 categorical).
    • For convenience, the original variable names were shortened in the data file. See the columns.csv file if you want to match the data with the original names.
    • The data contain missing values.
    • The survey was presented to participants in both electronic and written form.
    • The original questionnaire was in Slovak language and was later translated into English.
    • All participants were of Slovakian nationality, aged between 15-30.

    The variables can be split into the following groups:

    • Music preferences (19 items)
    • Movie preferences (12 items)
    • Hobbies & interests (32 items)
    • Phobias (10 items)
    • Health habits (3 items)
    • Personality traits, views on life, & opinions (57 items)
    • Spending habits (7 items)
    • Demographics (10 items)

    Research questions

    Many different techniques can be used to answer many questions, e.g.

    • Clustering: Given the music preferences, do people make up any clusters of similar behavior?
    • Hypothesis testing: Do women fear certain phenomena significantly more than men? Do the left handed people have different interests than right handed?
    • Predictive modeling: Can we predict spending habits of a person from his/her interests and movie or music preferences?
    • Dimension reduction: Can we describe a large number of human interests by a smaller number of latent concepts?
    • Correlation analysis: Are there any connections between music and movie preferences?
    • Visualization: How to effectively visualize a lot of variables in order to gain some meaningful insights from the data?
    • (Multivariate) Outlier detection: Small number of participants often cheats and randomly answers the questions. Can you identify them? Hint: [Local outlier factor][1] may help.
    • Missing values analysis: Are there any patterns in missing responses? What is the optimal way of imputing the values in surveys?
    • Recommendations: If some of user's interests are known, can we predict the other? Or, if we know what a person listen, can we predict which kind of movies he/she might like?

    Past research

    • (in slovak) Sleziak, P. - Sabo, M.: Gender differences in the prevalence of specific phobias. Forum Statisticum Slovacum. 2014, Vol. 10, No. 6. [Differences (gender + whether people lived in village/town) in the prevalence of phobias.]

    • Sabo, Miroslav. Multivariate Statistical Methods with Applications. Diss. Slovak University of Technology in Bratislava, 2014. [Clustering of variables (music preferences, movie preferences, phobias) + Clustering of people w.r.t. their interests.]

    Questionnaire

    MUSIC PREFERENCES

    1. I enjoy listening to music.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    2. I prefer.: Slow paced music 1-2-3-4-5 Fast paced music (integer)
    3. Dance, Disco, Funk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    4. Folk music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    5. Country: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    6. Classical: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    7. Musicals: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    8. Pop: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    9. Rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    10. Metal, Hard rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    11. Punk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    12. Hip hop, Rap: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    13. Reggae, Ska: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    14. Swing, Jazz: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    15. Rock n Roll: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    16. Alternative music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    17. Latin: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    18. Techno, Trance: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    19. Opera: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

    MOVIE PREFERENCES

    1. I really enjoy watching movies.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)
    2. Horror movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    3. Thriller movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    4. Comedies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    5. Romantic movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    6. Sci-fi movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
    7. War movies: Don't enjoy at all 1-2-3-4-5 E...
  10. General Social Survey 2012 Cross-Section and Panel Combined - Instructional...

    • thearda.com
    Updated 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom W. Smith (2012). General Social Survey 2012 Cross-Section and Panel Combined - Instructional Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/TH2CE
    Explore at:
    Dataset updated
    2012
    Dataset provided by
    Association of Religion Data Archives
    Authors
    Tom W. Smith
    Dataset funded by
    National Science Foundation
    Description

    This file contains all of the cases and variables that are in the original 2012 General Social Survey, but is prepared for easier use in the classroom. Changes have been made in two areas. First, to avoid confusion when constructing tables or interpreting basic analysis, all missing data codes have been set to system missing. Second, many of the continuous variables have been categorized into fewer categories, and added as additional variables to the file.

    The General Social Surveys (GSS) have been conducted by the National Opinion Research Center (NORC) annually since 1972, except for the years 1979, 1981, and 1992 (a supplement was added in 1992), and biennially beginning in 1994. The GSS are designed to be part of a program of social indicator research, replicating questionnaire items and wording in order to facilitate time-trend studies. This data file has all cases and variables asked on the 2012 GSS. There are a total of 4,820 cases in the data set but their initial sampling years vary because the GSS now contains panel cases. Sampling years can be identified with the variable SAMPTYPE.

    The 2012 GSS featured special modules on religious scriptures, the environment, dance and theater performances, health care system, government involvement, health concerns, emotional health, financial independence and income inequality.

    The GSS has switched from a repeating, cross-section design to a combined repeating cross-section and panel-component design. This file has a rolling panel design, with the 2008 GSS as the base year for the first panel. A sub-sample of 2,000 GSS cases from 2008 was selected for reinterview in 2010 and again in 2012 as part of the GSSs in those years. The 2010 GSS consisted of a new cross-section plus the reinterviews from 2008. The 2012 GSS consists of a new cross-section of 1,974, the first reinterview wave of the 2010 panel cases with 1,551 completed cases, and the second and final reinterview of the 2008 panel with 1,295 completed cases. Altogether, the 2012 GSS had 4,820 cases (1,974 in the new 2012 panel, 1,551 in the 2010 panel, and 1,295 in the 2008 panel).

    To download syntax files for the GSS that reproduce well-known religious group recodes, including RELTRAD, please visit the "/research/syntax-repository-list" Target="_blank">ARDA's Syntax Repository.

  11. Palestinian Family Survey 2010 - West Bank and Gaza

    • pcbs.gov.ps
    Updated May 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palestinian Central Bureau of Statistics (2023). Palestinian Family Survey 2010 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/709
    Explore at:
    Dataset updated
    May 25, 2023
    Dataset authored and provided by
    Palestinian Central Bureau of Statisticshttps://pcbs.gov/
    Time period covered
    2010
    Area covered
    West Bank, Gaza, Gaza Strip
    Description

    Abstract

    The survey is designed to collect, analyze, and disseminate demographic and health data pertaining to the Palestinian population living in the Palestinian Territory, with a focus on demography, fertility, infertility, family planning, unmet needs, and maternal and child health, in addition to youth and the elderly. The 2010 survey includes new sections and elements, such as basic health and socio-economic information on different groups within the population: ever married woman less than 55 years and children aged less than five years, child labor in the age 5-14 years, child discipline 2-14 years, person education 5-24 years, youth aged 15-29 years, and elderly people over the age of 60.

    Geographic coverage

    The Data are representative at region level (West Bank, Gaza Strip), locality type (urban, rural, camp) and governorates

    Analysis unit

    Household, individual

    Universe

    The survey covered all the Palestinian households who are a usual residence in the Palestine.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Target Population The target population of the survey consists of all the following groups: 1- All Palestinian households normally residing in the Palestinian Territory. 2- Females aged 15 - 54 years. 3- Elderly people aged 60 or over. 4- Children aged 0 - 14 years and divided into the following categories: 0-5 years, 2-14 years, 5-14 years, with parts of the questionnaire customized for each group. 5- Youth aged 15 - 29 years and divided into the following categories: 15-24 years, 25-29 years, with parts of the questionnaire customized for youth.

    Sampling Frame We relied on sampling frames established in PCBS and basically comprising the list of enumeration areas. (The enumeration area is a geographical area containing a number of buildings and housing units of about 120 housing units on average.)

    The total frame consists of the following two parts:

    1- West Bank and Gaza Sampling Frame: containing enumeration areas drawn up in 2007. In the West Bank: each enumeration area consists of a list of households with identification data to ascertain the address of individual households. In Gaza: each enumeration area contains a list of housing units with addresses to ascertain the address of individual households, plus identification data of the housing units.

    2- Jerusalem Sampling Frame (inside checkpoints): contains enumeration areas only, geographically divided with information about the total number of households in these areas. However, there is no detailed information about addresses inside enumeration areas and the size of the enumeration area can be ascertained without the ability to identify the addresses.

    Design Strata In the survey, two variables were chosen to divide the population into strata, depending on the homogeneity of parts of the population. Previous studies have shown that Palestinian households may be divided as follows: 1- Governorates: there are 16 governorates in the Palestinian Territory: 11 governorates in the West Bank and 5 in the Gaza Strip. 2- Locality Types: there are three types : urban, rural and refugee camps. All the available frames contain the strata variables. Sample Size We use the following formula to estimate the sample size:

        n  =  [(1.15) f(1-r) (r)4]  
      [(nh) p2(r0.07)]  
    

    Where: - n: sample size requested for the main indicator or main estimate - 4: is a factor to achieve a 95 percent level of confidence - r is the predicted or anticipated prevalence (coverage rate) for the indicator - being estimated - 1.15 is the factor necessary to raise the sample size by 20 percent for non-response - f is the design effect - 0.07r is the margin of error to be tolerated at the 95 percent level of confidence, defined as 7 percent of r (7 percent represents the relative sampling error of r) - p is the proportion of the total population upon which the indicator, r, is based - nh is the average household size

    To estimate the sample size of the survey we rely on the percentage of children under 5 years who suffer from stunting. We consider it as the main indicator for the survey (r) and it equals 10.2% (from MICS3 data -2006). Also, by returning to census data in 2007 we find the percentage of children aged 0 - 4 years =14.1%. Finally, the sample size = 15,355

    Sample Design and Type After determining the sample size, which equals 15456 households, we selected a probability sample - a multi-stage stratified cluster sample as follows: 1- First stage: selecting a sample of clusters (enumeration areas) using PPS without replacement method to obtain 644 enumeration areas from the total enumeration area frame. 2- Second stage: selecting 24 households from each selected enumeration area of the first stage and using the systematic sample method. When reaching households, we enumerate all the targeted individuals from the groups: women (15-54) years, elderly aged 60 and more, children aged 0-5 years. 3- Third stage: selecting one child of age group 2-14 years for part of the questionnaire and one young person from the 15-29 age group to answer the youth attachment in the questionnaire. We use the Kish table to select one person at random.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The design of the survey complied with the standard specifications of health surveys previously implemented by PCBS. In addition, the survey included indicators of MICS4 to meet the needs of all partners.

        1.  Main questionnaire with the following parts:
         · Household questionnaire: Covers demographic and educational characteristics, chronic disease, smoking, discipline of children (2-14 years), child labor (5-14 years), education of children (5-24 years) and housing characteristics.
    
         · Health of women (15-54 years) regardless of marital status, awareness about AIDS, anemia in women (15-49 years).
    
         · Ever married women (15-54 years): Covers general characteristics of qualified women, reproduction, child mortality, maternal care, reproductive morbidity, family planning, and attitudes towards reproduction.
    
         · Children under age of 5: Covers children's health, vaccination against childhood diseases, early childhood development, chronic disease, and anemia.
    
    
     2. Attached questionnaires
        ·  Youth questionnaire (15-29 years): Covers general characteristics, awareness and perception of family planning, health status, awareness about sexually transmitted diseases and reproduction.
    
        ·  Elderly questionnaire (60 years and over): Covers general characteristics, social relations, activities, time-use, health status, and use of mass media.
    

    Cleaning operations

    Data editing took place at a number of stages through the processing including:

    1. office editing and coding
    2. during data entry
    3. structure checking and completeness
    4. structural checking of SPSS data files

    Response rate

    The survey sample consists of about 15,355 households of which 13,629 households completed the interview; whereas 8,740 households from the West Bank and 4,889 households in Gaza Strip. Weights were modified to account for non-response rate. The response rate in the West Bank reached 90.5% while in the Gaza Strip it reached 94.8%. The response rate in the Palestinian Territory reached 92.0%.

    Sampling error estimates

    Detailed information on the sampling Error is available in the Survey Report.

    Data appraisal

    Different methods were applied in the assessment of the survey data, including: 1. Occurrences of missing values and answers like "other" and "do not know". 2. Examining inconsistencies between the various sections of the questionnaire, including within record and cross-record consistencies. 3. Comparability of data with previous surveys 2000, 2006 and showed logical homogeneity in the results. The results of these assessment procedures show that the data are of high quality and consistency.

  12. p

    Household Income and Expenditure Survey 2005 - Micronesia

    • microdata.pacificdata.org
    Updated Aug 18, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FSM Divison of Statistics (2013). Household Income and Expenditure Survey 2005 - Micronesia [Dataset]. https://microdata.pacificdata.org/index.php/catalog/19
    Explore at:
    Dataset updated
    Aug 18, 2013
    Dataset authored and provided by
    FSM Divison of Statistics
    Time period covered
    2005
    Area covered
    Micronesia
    Description

    Abstract

    The purpose of the HIES survey is to obtain information on the income, consumption pattern, incidence of poverty, and saving propensities for different groups of people in FSM. This information will be used to guide policy makers in framing socio-economic developmental policies and in initiating financial measures for improving economic conditions of the people. The 2005 FSM HIES asked income of all persons 15 years and over. It referred to income received during the calendar year 2004, and includes both cash and in-kind income. The survey has five primary objectives, namely to:

    1) Rebase the FSM Consumer Price Index (CPI); 2) Provide data on the distribution of income and expenditures throughout the FSM; 3) Provide data for national accounts, particularly regarding income from home production activities and the consumption of goods and services derived form home production activities; 4) Provide nutritional information and food consumption patterns for the FSM families; and 5) Provide data for hardship study.

    Geographic coverage

    Entire Country

    Four states of the FSM: Yap, Chuuk, Pohnpei, and Kosrae

    Analysis unit

    • Households
    • Individuals
    • Expenditure items

    Universe

    The survey universe covered all persons living in their place of usual residence at the time of the survey. Income data were collected from persons aged 15 years and over while expenditure data were obtained from all household members at a household level. Persons living in institutions, such as school dormitories, hospital wards, hostels, prisons, as well as those whose usual residence were somewhere else were excluded from the survey.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2005 FSM Household Income and Expenditure Survey (HIES) used a sampling frame based on updated information on Enumeration Districts (ED) and household listing from the 2000 FSM Census. Based on this sampling frame, the four states of FSM were then classified as the domains of the survey. Each of the states was further divided into 3 strata, except for Kosrae which was not divided at all because it doesn't possess any outer islands and it has relatively good access to goods and services. The entire island was therefore classified under stratum 1. Each stratum was defined as follows:

    1) State center and immediate surrounding areas:
    - High 'living standard' and has immediate access to goods and services.

    2) Areas surrounding state center (rest of main island):
    - Medium 'living standard' and sometime limited access to goods and services

    3) Outer islands:
    - Low 'living standard' and rare access to goods and services.

    Within each stratum, the HIES used a two-stage stratified sampling approach from which the sample was selected independently. First, enumeration districts (EDs) were drawn from each stratum using Proportion Probability to Size (PPS) sampling. Thus, the larger the ED size, the higher its probability of selection. About 69 EDs out of a total of 373 EDs were selected nationwide for the survey. Generally, one enumerator is assigned to each ED. Second, 20 households were systematically selected from an updated household listing for each of the selected EDs using a random start to come up with a total sample size of 1,380 households, or roughly 8.4 percent of all households in the state. Although it offered a fairly good representation of the total households in the nation, the final sample size showed a reduction of nearly 180 households from the 1,560 households, or 10 percent, initially selected for the survey.

    Detailed information on the changes made to the sample size can be found in the next section under "Deviations from Sample Design."

    Sampling deviation

    The original plan to sample 1,560 households, or about 9.5 percent of all households in the nation was eventually reduced to 1,380 households, or about 8.4 percent of all households. The reduction of the sample size was due to fuel unavailability for transportation and uncertainty of field trip schedules to some of the selected outer islands. Dropping some of these islands from the sample was not expected to impact significantly on the accuracy of the survey results because independent weighting took place within each stratum, where islands were considered to be sufficiently homogenous.

    Mode of data collection

    Other [oth]

    Research instrument

    Questionaires and forms used for the 2005 FSM HIES consisted of 1) HIES Questionnaire and 2) Weekly Diaries. The HIES Questionnaire were provided to enumerators and should be filled out during the first visit to the household. Its main objective was to collect housing information, basic demographic information about members of the household, and general household expenditures over the previous year. On the other hand, the weekly diaries, was an attempt to record household expenditure on a daily basis over the course of a 2 week period. Both the HIES questionnaire and the weekly diary were developed and modeled after similar forms from the 1998 FSM HIES Survey and the 2004 Palau HIES Survey. Dr. Micheal Levin from the US Census Bureau, International Program Center (IPC), Ms. Brihmer Johnson of the FSM Division of Statistics and Mr. Glenn McKinlay, statistics advisor to FSM Division of Statistics, provided crucial inputs to the overall design of these forms. All questionnaires and diaries used during the HIES were printed in English so it was extremely important that field interviewers understand the instuctions and questions contained within. Testing of the questionnaire were carried out by FSM Division of Statistics staffs who conducted "real" interviews with certain households in their neighborhood as well as having their own household be interviewed by a different office staff. Specific sections for both the HIES questionnaire and the weekly diaries are outlined below:

    I. HIES Questionnaire

    1) General Household characteristics 2 ) Individual Person Characteristics 3) General Expenditure Listings - 12 Months Recall Period

    II. 2 Week Daily Diaries

    1) Daily Expenditure Diary - Day1 (Mon) thru Day7 (Sun) 2) Home Produced Items 3) Gifts Given Away 4) Gifts Received 5) Unusual Expenses for Special Events

    Cleaning operations

    Data editing of the 2005 FSM HIES data occurred over several instances during the data processing phase of the project and afterwards prior to putting together the final report. After a two weeks office review and call backs right after the enumeration phase, the initial phase of data editing took place on July 18, 2005 when the data processing phase of the survey commenced. Training for editing and coding took place on the same day along with the signing of contracts for 10 office clerks recruited to carry out this phase of the survey. As part of their contract, these individuals were also hired to key in the data at a later time. One of their primary responsibily was to match geographic ids for questionnaire with corresponding diaries and ensure consistencies and valid entries accordingly. No computer consistency edit checks were run against the data during the keying/verification process since the programs for these processes were not available at the time. All data quality checks and edits were done at the US Bureau of Census. Further edits were applied to the data during the data analysis and report writing process.

    There were five types of checks performed: Structural check, Verification check, Consistency check, Macro Editing check, Data Quality assessment. Edit lists were also produced for health module, income and expenditure questionnaire which needed to be checked against the questionnaires. On the edit list, corrections of errors were made by crossing out incorrect or missing values and entering the correct values in red. Missing amounts that were also missing on the questionnaire will need to be estimated using estimates from questionnaires in the same Enumeration District (ED) batch. For the diaries, the batch files were concatenated for each state and exported to tab delimited files. These files were imported into Excel and the unit price for each item was calculated using quantities and weights where possible. Records for each item were then filtered out and check for outlier unit price values (both large and small values as well as missing values). Values for missing amounts were imputed from estimated using average prices from the items within the same ED.

    The office operations manual used for editing and coding the questionnaires and diaries is provided under "Technical Documents/Data Processing Documents/Office Editing & Coding."

    Response rate

    Original Sample Size: 1,560 Households Original Sampling fraction 9.5%

    Final Sample Size: 1,380 Households Final Sampling fraction 8.4%

    The response rate for the final sample size of 1,380 households is 100 percent. The majority of households originally selected for the survey did respond to the survey. Households which have moved to other unselected areas or elsewhere and those who refused to respond were replaced with nearby households that were willing to participate in the survey.

    Sampling error estimates

    No sampling error analysis of the survey was calculated.

    Data appraisal

    The questionnaire design of the 2005 HIES vary from that of the 1998 HIES rendering comparison of the data to the 2005 HIES limited. However, when the data permits, comparisons were made.

  13. General Social Survey 2014 Cross-Section and Panel Combined - Instructional...

    • thearda.com
    Updated 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom W. Smith (2014). General Social Survey 2014 Cross-Section and Panel Combined - Instructional Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/ZFRD2
    Explore at:
    Dataset updated
    2014
    Dataset provided by
    Association of Religion Data Archives
    Authors
    Tom W. Smith
    Dataset funded by
    National Science Foundation
    Description

    This file contains all of the cases and variables that are in the original 2014 General Social Survey, but is prepared for easier use in the classroom. Changes have been made in two areas. First, to avoid confusion when constructing tables or interpreting basic analysis, all missing data codes have been set to system missing. Second, many of the continuous variables have been categorized into fewer categories, and added as additional variables to the file.

    The General Social Surveys (GSS) have been conducted by the National Opinion Research Center (NORC) annually since 1972, except for the years 1979, 1981, and 1992 (a supplement was added in 1992), and biennially beginning in 1994. The GSS are designed to be part of a program of social indicator research, replicating questionnaire items and wording in order to facilitate time-trend studies. This data file has all cases and variables asked on the 2014 GSS. There are a total of 3,842 cases in the data set but their initial sampling years vary because the GSS now contains panel cases. Sampling years can be identified with the variable SAMPTYPE.

    To download syntax files for the GSS that reproduce well-known religious group recodes, including RELTRAD, please visit the "/research/syntax-repository-list" Target="_blank">ARDA's Syntax Repository.

  14. General Household Survey 2003 - South Africa

    • datafirst.uct.ac.za
    Updated Oct 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics South Africa (2020). General Household Survey 2003 - South Africa [Dataset]. https://www.datafirst.uct.ac.za/dataportal/index.php/catalog/86
    Explore at:
    Dataset updated
    Oct 22, 2020
    Dataset authored and provided by
    Statistics South Africahttp://www.statssa.gov.za/
    Time period covered
    2003
    Area covered
    South Africa
    Description

    Abstract

    The GHS is an annual household survey specifically designed to measure the living circumstances of South African households. The GHS collects data on education, employment, health, housing and household access to services. GHS is designed to measure the level of development and performance of various government programmes and projects.

    Geographic coverage

    The survey is representative at national level and at provincial level.

    Analysis unit

    Households and individuals

    Universe

    The survey covered all de jure household members (usual residents) of households in the nine provinces of South Africa and residents in workers' hostels. The survey does not cover collective living quarters such as students' hostels, old age homes, hospitals, prisons and military barracks.

    Kind of data

    Sample survey data

    Sampling procedure

    The sample is multi-stage stratified using probability proportional to size principles. The first stage is stratification by province, then by type of area within each province. Primary sampling units (PSUs) are then selected proportionally within each stratum (urban or non-urban) in all provinces.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    GHS uses questionnaires as data collection instruments

    Data appraisal

    Earlier versions of the GHS datasets 2002 to 2007 include a District Council variable. This is no longer available in the later versions issued by Statistics SA. They caution that although the GHS 2005-2007 sample was designed to report at DC level, estimations are not reliable at this level. The 2008 - 2013 sample was designed to report at provincial and metro level. However, StatsSA did not take the absent population at metro into account when weighting the data and therefore this data is not reliable at Metro level.

    The new programs that were introduced for weighting of the general household surveys from 2008 onwards, discard all records with missing values for age, sex or population group (for observations at household level, they are the values for age, sex or population group of the household head). This means that missing values of those variables were imputed. The emphasis was on obtaining reliable imputations rather than a 100% imputation rate, so some persons/households were discarded during the weighting.

  15. i

    National Panel Survey 2010-2011 - Tanzania

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    • +1more
    Updated Mar 29, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Bureau of Statistics (2019). National Panel Survey 2010-2011 - Tanzania [Dataset]. https://datacatalog.ihsn.org/catalog/4617
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    National Bureau of Statistics
    Time period covered
    2010 - 2011
    Area covered
    Tanzania
    Description

    Abstract

    The main objective of the TZNPS is to provide high-quality household-level data to the Tanzanian government and other stakeholders for monitoring poverty dynamics, tracking the progress of the Mkukuta poverty reduction strategy1, and to evaluate the impact of other major, national-level government policy initiatives. As an integrated survey covering a number of different socioeconomic factors, it compliments other more narrowly focused survey efforts, such as the Demographic and Health Survey on health, the Integrated Labour Force Survey on labour markets, the Household Budget Survey on expenditure, and the National Sample Census of Agriculture. Secondly, as a panel household survey in which the same households are revisited over time, the TZNPS allows for the study of poverty and welfare transitions and the determinants of living standard changes

    Geographic coverage

    National Coverage: Dar es Salaam, other urban areas in Mainland, rural areas in Mainland, and Zanzibar

    Analysis unit

    A living standards survey with community-level questionnaire with the following units of analysis: individuals, household, and communities.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample design for the second round of the NPS revisits all the households interviewed in the first round of the panel, as well as tracking adult split-off household members. The original sample size of 3,265 households was designed to representative at the national, urban/rural, and major agro-ecological zones. The total sample size was 3,265 households in 409 Enumeration Areas (2,063 households in rural areas and 1,202 urban areas). It is also be possible in the final analysis to produce disaggregated poverty rates for 4 different strata: Dar es Salaam, other urban areas on mainland Tanzania, rural mainland Tanzania, and Zanzibar.

    Since the TZNPS is a panel survey, the second round of the fieldwork revisits all households originally interviewed during round one. If a household has moved from its original location, the members were interviewed in their new location. If that location was within one hour of the original location, the field team did the interview at the time of their visit to the enumeration area. If the household had located more than an hour from the original location, details of the new location were recorded on specialized forms, and the information passed to a dedicated tracking team for follow-up.

    If a member of the original household had split from their original location to form or join a new household, information was recorded on the current whereabouts of this member. All adult former household members (those over the age of 15) were tracked to their new location. Similar to the protocol for the re-located households, if the new household is within one hour of the original location, the new household was interviewed by the main field team at the time of the visit to the enumeration area. For those that have moved more than one hour away, their information was passed to the dedicated tracking team for follow-up. Once the tracking targets have been found, teams are required to interview them and any new members of the household.

    The total sample size for the second round of the NPS has a total sample size of 3924 households. This represents 3168 round-one households, a re-interview rate of over 97 percent. In addition, of the 10,420 eligible adults (over age 15 in 2010), 9,338 were re-interviewed, a reinterview rate of approximately 90 percent.

    Sampling deviation

    To obtain the attrition adjustment factor the probability that a sample household was successfully reinterviewed in the second round of surveys is modeled with the linear logistic model at the level of the individual. A binary response variable is created by coding the response disposition for eligible households that do not respond in the second round as 0, and households that do respond as 1. Then a logistic response propensity model is fitted, using 2005 UNHS household and individual characteristics measured in the first wave as covariates.

    In a few limited cases, values of unit level variables were missing from the 2008/2009 household dataset. These values were imputed using multivariate regression and logistic regression techniques. Imputations are done using the ‘impute’ command in Stata at the level of the UNPS strata (urban/rural and region). Overall, less than one percent of the variables required imputation to replace missing values.

    The estimated logistic model is used to obtain a predicted probability of response for each household member in the 2010/2011 survey. These response probabilities were then aggregated to the household level (by calculating the mean), the using the household-level predicted response probabilities as the ranking variable, all households are ranked into 10 equal groups (deciles). An attrition adjustment factor was then defined as the reciprocal of the empirical response rate for the household-level propensity score decile.

    To reduce the overall standard errors, and weight the population totals up to the known population figures, a post-stratification correction is applied. Based on the projected number of households in the urban and rural segments of each region, adjustment factors are calculated. This correction also reduces overall standard errors (see Little et al, 1997).

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The Household Questionnaire is comprised of thematic sections.This comprehensive questionnaire allows for the construction of a full consumption-based welfare measure, permitting distributional and incidence analysis. This project also recognizes the imperative to look beyond the household as a unit of analysis in order to improve the quality, relevance and sustainability of agricultural data systems. Although data collection is structured around a household panel survey, the data on labor, education, and health status were collected at the individual level. Moreover, in some household activities (like non-farm enterprise), the questionnaire records which specific members are engaged in the activity. A detailed description of the contents of the questionnaire can be found in the Basic Information Document report (Table 1).

    The Agricultural Questionnaire collects information relative to a household’s agricultural activities. Information is collected at both the plot and crop level on inputs, production and sales. The Basic Information Document report (Table 2) provides a detailed description of the contents of the questionnaire. This questionnaire was administered to any household that engaged in any farming or livestock holding.

    The Fisheries Questionnaire was developed in partnership with the World Fish Program to collect data on household fishery activities, fish processing, and fish trading. This includes data on the inputs, outputs, labour, and sales. All this data is divided into two reference periods, the high and low season. This data is collected at the household level. The Basic Information Document report (Table 3) provides a more comprehensive list of the sections found within the Fishery Questionnaire.

    The Community Questionnaire collects information on physical and economic infrastructure and events in surveyed communities. In each selected survey community, key informants are interviewed by the field team supervisors. Information about the respondents for the community questionnaire is collected individually in section CI of community questionnaire.

    The questionnaires were developed in collaboration with line ministries and donor partners, including the Technical Committee, over a period of several months. The NBS solicited feedback from various stakeholders in regards to survey content and design. The round two questionnaires were piloted in the Morogoro region in June 2010, in conjunction with supervisor training. After piloting, the questionnaires were further revised and finalized by August 2010. Questionnaire manuals were developed with detailed instructions for field staff during training and as the main survey reference guide over the course of the field work.

    Cleaning operations

    CSPro-based data entry/editing system was used.

    A cross comparison between the entered values in the field based data entry and double entry was conducted and any differences in values between the two were flagged for manual inspection of the physical questionnaire. Corrections based on this inspection exercise were ultimately encoded in the dataset.

    Additionally, an extensive review of data files was conducted, including interviewer errors such as missing values, ranges and outliers. Observations were returned for manual inspection of the physical questionnaires if continuous values fell outside five standard deviations of the mean, categorical values were not eligible responses, or there were internal inconsistencies within the dataset (for example, the age of an individual was not consistent with their educational status, there was more than one head of household listed, an individual was engaged in multiple primary activities, the quantity of crops and their byproducts produced, harvested, and sold not listed, the distance from the market and an individual’s plot was not listed, the number of weeks, days per week, and hours per day an individual engaged in fishery activity was not recorded, the species and quantity of fish caught, bought, sold, or traded was not listed, etc). When it was determined that these values were the result of data-entry error, the values were corrected. In addition, cases deemed to reflect obvious enumerator error were also

  16. National Household Income and Expenditure Survey 2009-2010 - Namibia

    • microdata.worldbank.org
    • catalog.ihsn.org
    • +1more
    Updated Apr 11, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Namibia Statistics Agency (2018). National Household Income and Expenditure Survey 2009-2010 - Namibia [Dataset]. https://microdata.worldbank.org/index.php/catalog/1548
    Explore at:
    Dataset updated
    Apr 11, 2018
    Dataset authored and provided by
    Namibia Statistics Agencyhttps://nsa.org.na/
    Time period covered
    2009 - 2010
    Area covered
    Namibia
    Description

    Abstract

    The Household Income and Expenditure Survey (NHIES) 2009 was a survey collecting data on income, consumption and expenditure patterns of households, in accordance with methodological principles of statistical enquiries, which were linked to demographic and socio-economic characteristics of households. A Household Income and expenditure Survey was the sole source of information on expenditure, consumption and income patterns of households, which was used to calculate poverty and income distribution indicators. It also served as a statistical infrastructure for the compilation of the national basket of goods used to measure changes in price levels. It was also used for updating the national accounts.

    The main objective of the NHIES 2009-2010 was to comprehensively describe the levels of living of Namibians using actual patterns of consumption and income, as well as a range of other socio-economic indicators based on collected data. This survey was designed to inform policy making at the international, national and regional levels within the context of the Fourth National Development Plan, in support of monitoring and evaluation of Vision 2030 and the Millennium Development Goals (MDG's). The NHIES was designed to provide policy decision making with reliable estimates at regional levels as well as to meet rural - urban disaggregation requirements.

    Geographic coverage

    National

    Analysis unit

    • Individuals
    • Households

    Universe

    Every week of the four weeks period of a survey round all persons in the household were asked if they spent at least 4 nights of the week in the household. Any person who spent at least 4 nights in the household was taken as having spent the whole week in the household. To qualify as a household member a person must have stayed in the household for at least two weeks out of four weeks.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The targeted population of NHIES 2009-2010 was the private households of Namibia. The population living in institutions, such as hospitals, hostels, police barracks and prisons were not covered in the survey. However, private households residing within institutional settings were covered. The sample design for the survey was a stratified two-stage probability sample, where the first stage units were geographical areas designated as the Primary Sampling Units (PSUs) and the second stage units were the households. The PSUs were based on the 2001 Census EAs and the list of PSUs serves as the national sample frame. The urban part of the sample frame was updated to include the changes that take place due to rural to urban migration and the new developments in housing. The sample frame is stratified first by region followed by urban and rural areas within region. In urban areas, further stratification is carried out by level of living which is based on geographic location and housing characteristics. The first stage units were selected from the sampling frame of PSUs and the second stage units were selected from a current list of households within each selected PSU, which was compiled just before the interviews.

    PSUs were selected using probability proportional to size sampling coupled with the systematic sampling procedure where the size measure was the number of households within the PSU in the 2001 Population and Housing Census (PHC). The households were selected from the current list of households using systematic sampling procedure.

    The sample size was designed to achieve reliable estimates at the region level and for urban and rural areas within each region. However, the actual sample sizes in urban or rural areas within some of the regions may not satisfy the expected precision levels for certain characteristics. The final sample consists of 10 660 households in 533 PSUs. The selected PSUs were randomly allocated to the 13 survey rounds.

    Sampling deviation

    All the expected sample of 533 PSUs was covered. However, a number of originally selected PSUs had to be substituted by new ones due to the following reasons.

    Urban areas: Movement of people for resettlement in informal settlement areas from one place to another caused a selected PSU to be empty of households.

    Rural areas: In addition to Caprivi region (where one constituency is generally flooded every year) Ohangwena and Oshana regions were badly affected from an unusual flood situation. Although this situation was generally addressed by interchanging the PSUs between survey rounds still some PSUs were under water close to the end of the survey period.

    There were five empty PSUs in the urban areas of Hardap (1), Karas (3) and Omaheke (1) regions. Since these PSUs were found in the low strata within the urban areas of the relevant regions the substituting PSUs were selected from the same strata. The PSUs under water were also five in rural areas of Caprivi (1), Ohangwena (2) and Oshana (2) regions. Wherever possible the substituting PSUs were selected from the same constituency where the original PSU was selected. If not, the selection was carried out from the rural stratum of the particular region.

    One sampled PSU in urban area of Khomas region (Windhoek city) had grown so large that it had to be split into 7 PSUs. This was incorporated into the geographical information system (GIS) and one PSU out of the seven was selected for the survey. In one PSU in Erongo region only fourteen households were listed and one in Omusati region listed only eleven households. All these households were interviewed and no additional selection was done to cover for the loss in sample.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The instruments for data collection were as in the previous survey the questionnaires and manuals. Form I questionnaire collected demographic and socio-economic information of household members, such as: sex, age, education, employment status among others. It also collected information on household possessions like animals, land, housing, household goods, utilities, household income and expenditure, etc.

    Form II or the Daily Record Book is a diary for recording daily household transactions. A book was administered to each sample household each week for four consecutive weeks (survey round). Households were asked to record transactions, item by item, for all expenditures and receipts, including incomes and gifts received or given out. Own produce items were also recorded. Prices of items from different outlets were also collected in both rural and urban areas. The price collection was needed to supplement information from areas where price collection for consumer price indices (CPI) does not currently take place.

    Cleaning operations

    The data capturing process was undertaken in the following ways: Form 1 was scanned, interpreted and verified using the “Scan”, “Interpret” & “Verify” modules of the Eyes & Hands software respectively. Some basic checks were carried out to ensure that each PSU was valid and every household was unique. Invalid characters were removed. The scanned and verified data was converted into text files using the “Transfer” module of the Eyes & Hands. Finally, the data was transferred to a SQL database for further processing, using the “TranScan” application. The Daily Record Books (DRB or form 2) were manually entered after the scanned data had been transferred to the SQL database. The reason was to ensure that all DRBs were linked to the correct Form 1, i.e. each household's Form 1 was linked to the corresponding Daily Record Book. In total, 10 645 questionnaires (Form 1), comprising around 500 questions each, were scanned and close to one million transactions from the Form 2 (DRBs) were manually captured.

    Response rate

    Household response rate: Total number of responding households and non-responding households and the reason for non-response are shown below. Non-contacts and incomplete forms, which were rejected due to a lot of missing data in the questionnaire, at 3.4 and 4.0 percent, respectively, formed the largest part of non-response. At the regional level Erongo, Khomas, and Kunene reported the lowest response rate and Caprivi and Kavango the highest.

    Data appraisal

    To be able to compare with the previous survey in 2003/2004 and to follow up the development of the country, methodology and definitions were kept the same. Comparisons between the surveys can be found in the different chapters in this report. Experiences from the previous survey gave valuable input to this one and the data collection was improved to avoid earlier experienced errors. Also, some additional questions in the questionnaire helped to confirm the accuracy of reported data. During the data cleaning process it turned out, that some households had difficulty to separate their household consumption from their business consumption when recording their daily transactions in DRB. This was in particular applicable for the guest farms, the number of which has shown a big increase during the past five years. All households with extreme high consumption were examined manually and business transactions were recorded and separated from private consumption.

  17. Social Survey of Jerusalem 2005 - West Bank and Gaza

    • pcbs.gov.ps
    Updated Nov 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palestinian Central Bureau of Statistics (2020). Social Survey of Jerusalem 2005 - West Bank and Gaza [Dataset]. https://www.pcbs.gov.ps/PCBS-Metadata-en-v5.2/index.php/catalog/431
    Explore at:
    Dataset updated
    Nov 4, 2020
    Dataset authored and provided by
    Palestinian Central Bureau of Statisticshttps://pcbs.gov/
    Time period covered
    2005
    Area covered
    West Bank, Gaza, Gaza Strip
    Description

    Abstract

    The Jerusalem Household Social Survey 2005 is one of the most important statistical activities that have been conducted by PCBS. It is the most detailed and comprehensive statistical activity that PCBS has conducted in Jerusalem. The main objective of the Jerusalem household social survey, 2005 is to provide basic information about: Demographic and social characteristics for the Palestinian society in Jerusalem governorate including age-sex structure, Illiteracy rate, enrollment and drop-out rates by background characteristics, Labor force status, unemployment rate, occupation, economic activity, employment status, place of work and wage levels, Housing and housing conditions, Living levels and impact of Israeli measures on nutrition behavior during Al-Aqsa intifada, Criminal offence, its victims, and injuries caused.

    Geographic coverage

    Social survey data covering the province of Jerusalem only, the type locality (urban, rural, refugee camps) and Governorate

    Analysis unit

    households, Individual

    Universe

    The target population was all Palestinian households living in Jerusalem Governorate.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The Sample Frame Were estimated sample size of Jerusalem by 3,300 family, including 2,240 families in the region J1, and 1,060 families in the region of J2 has been the establishment of Sample Frame to Jerusalem (J2) of the General Census of Population and Housing, and Establishment, which was carried out by the PCBS at the end of 1997, was create Sample Frame to Jerusalem (J1) of project data that has been exclusively in 2004. And the frame is a list of counting areas, and these areas are used as units an initial preview (PSUs) in the first stage of the process of selecting the sample. Stratified cluster random sample of regular two phases: Phase I: was selected a stratified random sample of enumeration areas from Jerusalem (J1) and Jerusalem (J2). The number of enumeration areas that have been chosen counting area 123 divided into two regions: 70 the count of Jerusalem (J1), 53 the count of Jerusalem (J2). Phase II: A random sample was withdrawn systematically with size of 20 families from each enumeration area that was selected in the first stage of the Jerusalem J2, and 32 families from each enumeration area that was selected in the first stage of the Jerusalem J1.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    A survey questionnaire the main tool for gathering information, so do not need to check the technical specifications for the phase of field work, as required to achieve the requirements of data processing and analysis, has been designed form the survey after examining the experience of other countries on the subject of social surveys, covering the form as much as possible the most important social indicators as recommended by the United Nations, taking into account the specificity of the Palestinian community in this aspect.

    Cleaning operations

    Phase included a set of data processing Activities and operations that have been made to the Forms to prepare her for the analysis phase, This phase included the following operations: Before the introduction of audit data: at this stage was Check all the forms using the instructions To check to make sure the field of logical data and re- Incomplete, including a second field. Data Entry: The data entry Central to the central headquarters in Al-Bireh, was organized The data entry process using the BLAISE Program Where the form has been programmed through this program. Was marked by the program that was developed in the Device properties and features the following: The possibility of dealing with an exact copy of the form The computer screen. The ability to conduct all tests and possibilities Possible and logical sequence of data in the form. Maintain a minimum of errors Portal Digital data or errors of field work. Ease of use and deal with the software and data (User-Friendly). The possibility of converting the data to the other formula can be Use and analysis of the statistical systems Analysis such as SPSS.

    Response rate

    during the field work we visit 3,300 family in Jerusalem Governorate, 2,240 in Area J1 and1,060 in Area J2 where the final results of the interviews were as follows: The number of families who were interviewed (2,485) in Jerusalem Governorate, complete questioner 75.3% (1,773) in J1 79.2% (712) in J2 67.2%

    Sampling error estimates

    Data were collected in a manner that the survey sample and not Balhsr destruction, so she is exposed to two main types of errors. The first sampling errors (statistical errors), and the second non-statistical errors. It is intended that sampling errors of the errors resulting from sample design, so it is easy to measure, the contrast has been calculated and the effect of sample design.

    The non-statistical errors are possible to occur in every stage of project implementation, through data collection, inserting, and mistakes can be summarized by the non-response, and response errors (surveyed), and the mistakes of the interview (the researcher) and data-entry errors. To avoid errors and reduce the impact it has made significant efforts through the training of researchers extensive training, and the presence of a group of experts in the concepts and terminology, medical / health, and training on how to conduct interviews, and the things that must be followed during the interview, and the things that should be avoided.

    Have been trained on the data entry program entry, program, and were examined in order to see the picture of the situation and reduce any problems, there was constant contact between supervisors and checkers through ongoing visits and periodic meetings. In addition, has been drafting a set of circulars and instructions reminder to the team. Also been circulated answers to questions and problems faced by the researchers during the field work.

    As for office work have been trained crew to check the special forms and field detection of errors, which greatly reduces the rates of errors that can occur during field work. In order to reduce the proportion of errors that can occur during entry form to the computer, the software is designed to entry so as not to allow any errors Tnasagah can get during the process of input and contains many of the conditions Logical, where they were loading the program the input of many tests on private answers each question in addition to the relations between the different questions and testing the other logical. This process has led to the disclosure of most of the errors that are not found in previous phases of work, where they were correct all errors that have been discovered.

    Data were evaluated according to the following areas: 1. Definition of family members and how to register. 2. Demographic characteristics that have a relationship on Christmas. 3. Breakdown of the profession and activity.

    Methods of assessment vary according to the data subject in this survey include the following: 1. Occurrences of missing values and Answers "other" and "Do not know" and examine inconsistencies between different sections or between the date of birth and other sections. Add to examine the internal consistency of the data as part of a logical data and completeness. 2. Compared to survey data with the results of surveys of the relationship and by the Central Bureau of Statistics Palestinian implementation.

    Can be summarized as sources of some non-statistical errors that have emerged during the implementation of the survey including the following: Inability to meet the data in some cases the forms because of the lack of a home or be in the housing unit does not exist or are uninhabited and there are families not able to provide some data or refused to do so. Some families did not take the form subject very seriously affecting the quality of the data provided. Errors resulting from the method of asking the question by the researcher in the field. Category understand the question and answer based on his understanding of it. The inability of the technical team overseeing the project from the field visit on a regular basis for all duty stations in order to see the workflow and meet researchers and directing them, especially in the area J1. There was difficulty in reaching the families because of the construction of the wall, especially in the Ram Area and also in the area of Bir Nabala where the switch was a full count area due to additional incompleteness caused by the absence of the families in the region because of the separation wall. It was not easy to follow and adjust the time researchers because of the prevailing security conditions.

  18. a

    Levels of obesity, inactivity and associated illnesses (England): Missing...

    • hub.arcgis.com
    • data.catchmentbasedapproach.org
    Updated Apr 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Rivers Trust (2021). Levels of obesity, inactivity and associated illnesses (England): Missing data [Dataset]. https://hub.arcgis.com/datasets/theriverstrust::levels-of-obesity-inactivity-and-associated-illnesses-england-missing-data/explore
    Explore at:
    Dataset updated
    Apr 8, 2021
    Dataset authored and provided by
    The Rivers Trust
    Area covered
    Description

    SUMMARYTo be viewed in combination with the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.This dataset shows where there was no data* relating to one of more of the following factors:Obesity/inactivity-related illnesses (recorded at the GP practice catchment area level*)Adult obesity (recorded at the GP practice catchment area level*)Inactivity in children (recorded at the district level)Excess weight in children (recorded at the Middle Layer Super Output Area level)* GPs do not have catchments that are mutually exclusive from each other: they overlap, with some geographic areas being covered by 30+ practices.GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. This dataset identifies areas where data from 2019/20 was used, where one or more GPs did not submit data in either year (this could be because there are rural areas that aren’t officially covered by any GP practices), or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution.Results of the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ analysis in these areas should be interpreted with caution, particularly if the levels of obesity, inactivity and associated illnesses appear to be significantly lower than in their immediate surrounding areas.Really small areas with ‘missing’ data were deleted, where it was deemed that missing data will not have impacted the overall analysis (i.e. where GP data was missing from really small countryside areas where no people live).See also Health and wellbeing statistics (GP-level, England): Missing data and potential outliers dataDATA SOURCESThis dataset was produced using:- Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.- National Child Measurement Programme: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. - Active Lives Survey 2019: Sport and Physical Activity Levels amongst children and young people in school years 1-11 (aged 5-16). © Sport England 2020.- Active Lives Survey 2019: Sport and Physical Activity Levels amongst adults aged 16+. © Sport England 2020.- GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.- Administrative boundaries: Boundary-LineTM: Contains Ordnance Survey data © Crown copyright and database right 2021. Contains public sector information licensed under the Open Government Licence v3.0.- MSOA boundaries: © Office for National Statistics licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2021.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital; © Sport England 2020; © Office for National Statistics licensed under the Open Government Licence v3.0. Contains Ordnance Survey data © Crown copyright and database right 2021. Contains public sector information licensed under the Open Government Licence v3.0.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.

  19. f

    Project for Statistics on Living Standards and Development 1993 - South...

    • microdata.fao.org
    • catalog.ihsn.org
    • +2more
    Updated Oct 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Southern Africa Labour and Development Research Unit (2020). Project for Statistics on Living Standards and Development 1993 - South Africa [Dataset]. https://microdata.fao.org/index.php/catalog/1527
    Explore at:
    Dataset updated
    Oct 20, 2020
    Dataset authored and provided by
    Southern Africa Labour and Development Research Unit
    Time period covered
    1993
    Area covered
    South Africa
    Description

    Abstract

    The Project for Statistics on Living standards and Development was a countrywide World Bank Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect statistical information about the conditions under which South Africans live in order to provide policymakers with the data necessary for planning strategies. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.

    Geographic coverage

    National

    Analysis unit

    Households

    Universe

    All Household members. Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described above for the households in ESDs.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    (a) SAMPLING DESIGN

    Sample size is 9,000 households. The sample design adopted for the study was a two-stage self-weighting design in which the first stage units were Census Enumerator Subdistricts (ESDs, or their equivalent) and the second stage were households. The advantage of using such a design is that it provides a representative sample that need not be based on accurate census population distribution in the case of South Africa, the sample will automatically include many poor people, without the need to go beyond this and oversample the poor. Proportionate sampling as in such a self-weighting sample design offers the simplest possible data files for further analysis, as weights do not have to be added. However, in the end this advantage could not be retained, and weights had to be added.

    (b) SAMPLE FRAME

    The sampling frame was drawn up on the basis of small, clearly demarcated area units, each with a population estimate. The nature of the self-weighting procedure adopted ensured that this population estimate was not important for determining the final sample, however. For most of the country, census ESDs were used. Where some ESDs comprised relatively large populations as for instance in some black townships such as Soweto, aerial photographs were used to divide the areas into blocks of approximately equal population size. In other instances, particularly in some of the former homelands, the area units were not ESDs but villages or village groups. In the sample design chosen, the area stage units (generally ESDs) were selected with probability proportional to size, based on the census population. Systematic sampling was used throughout that is, sampling at fixed interval in a list of ESDs, starting at a randomly selected starting point. Given that sampling was self-weighting, the impact of stratification was expected to be modest. The main objective was to ensure that the racial and geographic breakdown approximated the national population distribution. This was done by listing the area stage units (ESDs) by statistical region and then within the statistical region by urban or rural. Within these sub-statistical regions, the ESDs were then listed in order of percentage African. The sampling interval for the selection of the ESDs was obtained by dividing the 1991 census population of 38,120,853 by the 300 clusters to be selected. This yielded 105,800. Starting at a randomly selected point, every 105,800th person down the cluster list was selected. This ensured both geographic and racial diversity (ESDs were ordered by statistical sub-region and proportion of the population African). In three or four instances, the ESD chosen was judged inaccessible and replaced with a similar one. In the second sampling stage the unit of analysis was the household. In each selected ESD a listing or enumeration of households was carried out by means of a field operation. From the households listed in an ESD a sample of households was selected by systematic sampling. Even though the ultimate enumeration unit was the household, in most cases "stands" were used as enumeration units. However, when a stand was chosen as the enumeration unit all households on that stand had to be interviewed.

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    All the questionnaires were checked when received. Where information was incomplete or appeared contradictory, the questionnaire was sent back to the relevant survey organization. As soon as the data was available, it was captured using local development platform ADE. This was completed in February 1994. Following this, a series of exploratory programs were written to highlight inconsistencies and outlier. For example, all person level files were linked together to ensure that the same person code reported in different sections of the questionnaire corresponded to the same person. The error reports from these programs were compared to the questionnaires and the necessary alterations made. This was a lengthy process, as several files were checked more than once, and completed at the beginning of August 1994. In some cases, questionnaires would contain missing values, or comments that the respondent did not know, or refused to answer a question.

    These responses are coded in the data files with the following values: VALUE MEANING -1 : The data was not available on the questionnaire or form -2 : The field is not applicable -3 : Respondent refused to answer -4 : Respondent did not know answer to question

    Data appraisal

    The data collected in clusters 217 and 218 should be viewed as highly unreliable and therefore removed from the data set. The data currently available on the web site has been revised to remove the data from these clusters. Researchers who have downloaded the data in the past should revise their data sets. For information on the data in those clusters, contact SALDRU http://www.saldru.uct.ac.za/.

  20. f

    Initial data analysis checklist for data screening in longitudinal studies.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Initial data analysis checklist for data screening in longitudinal studies. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial data analysis checklist for data screening in longitudinal studies.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Organization logoOrganization logo

Understanding and Managing Missing Data.pdf

Explore at:
pdfAvailable download formats
Dataset updated
Jun 9, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ibrahim Denis Fofanah
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

Search
Clear search
Close search
Google apps
Main menu