The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.
Column Name | Data Type Category | Description |
---|---|---|
Household_ID | Categorical (Nominal) | Unique identifier for each household |
Date | Datetime | The date of the energy usage record |
Energy_Consumption_kWh | Numerical (Continuous) | Total energy consumed by the household in kWh |
Household_Size | Numerical (Discrete) | Number of individuals living in the household |
Avg_Temperature_C | Numerical (Continuous) | Average daily temperature in degrees Celsius |
Has_AC | Categorical (Binary) | Indicates if the household has air conditioning (Yes/No) |
Peak_Hours_Usage_kWh | Numerical (Continuous) | Energy consumed during peak hours in kWh |
Library | Purpose |
---|---|
pandas | Reading, cleaning, and transforming tabular data |
numpy | Numerical operations, working with arrays |
Library | Purpose |
---|---|
matplotlib | Creating static plots (line, bar, histograms, etc.) |
seaborn | Statistical visualizations, heatmaps, boxplots, etc. |
plotly | Interactive charts (time series, pie, bar, scatter, etc.) |
Library | Purpose |
---|---|
scikit-learn | Preprocessing, regression, classification, clustering |
xgboost / lightgbm | Gradient boosting models for better accuracy |
Library | Purpose |
---|---|
sklearn.preprocessing | Encoding categorical features, scaling, normalization |
datetime / pandas | Date-time conversion and manipulation |
Library | Purpose |
---|---|
sklearn.metrics | Accuracy, MAE, RMSE, R² score, confusion matrix, etc. |
✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.
This dataset is ideal for a wide variety of analytics and machine learning projects:
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. Overview 10 classes, 1 for each digit. Digit 1 has label 1, 9 has label 9 and 0 has label 10. 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data Comes in two formats: 1. Original images with character level bounding boxes. 2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides). These are the origina
The Annual Population Survey (APS) household datasets are produced annually and are available from 2004 (Special Licence) and 2006 (End User Licence). They allow production of family and household labour market statistics at local areas and for small sub-groups of the population across the UK. The household data comprise key variables from the Labour Force Survey (LFS) and the APS 'person' datasets. The APS household datasets include all the variables on the LFS and APS person datasets, except for the income variables. They also include key family and household-level derived variables. These variables allow for an analysis of the combined economic activity status of the family or household. In addition, they also include more detailed geographical, industry, occupation, health and age variables.
For further detailed information about methodology, users should consult the Labour Force Survey User Guide, included with the APS documentation. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation, users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.
Occupation data for 2021 and 2022
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. Further information can be found in the ONS article published on 11 July 2023: Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022
End User Licence and Secure Access APS data
Users should note that there are two versions of each APS dataset. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. The EUL version includes Government Office Region geography, banded age, 3-digit SOC and industry sector for main, second and last job. The Secure Access version contains more detailed variables relating to:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Sweet Home by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Sweet Home. The dataset can be utilized to understand the population distribution of Sweet Home by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Sweet Home. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Sweet Home.
Key observations
Largest age group (population): Male # 25-29 years (472) | Female # 70-74 years (462). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Age groups:
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Sweet Home Population by Gender. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
By the middle of the 1990s, Indonesia had enjoyed over three decades of remarkable social, economic, and demographic change and was on the cusp of joining the middle-income countries. Per capita income had risen more than fifteenfold since the early 1960s, from around US$50 to more than US$800. Increases in educational attainment and decreases in fertility and infant mortality over the same period reflected impressive investments in infrastructure. In the late 1990s the economic outlook began to change as Indonesia was gripped by the economic crisis that affected much of Asia. In 1998 the rupiah collapsed, the economy went into a tailspin, and gross domestic product contracted by an estimated 12-15%-a decline rivaling the magnitude of the Great Depression. The general trend of several decades of economic progress followed by a few years of economic downturn masks considerable variation across the archipelago in the degree both of economic development and of economic setbacks related to the crisis. In part this heterogeneity reflects the great cultural and ethnic diversity of Indonesia, which in turn makes it a rich laboratory for research on a number of individual- and household-level behaviors and outcomes that interest social scientists. The Indonesia Family Life Survey is designed to provide data for studying behaviors and outcomes. The survey contains a wealth of information collected at the individual and household levels, including multiple indicators of economic and non-economic well-being: consumption, income, assets, education, migration, labor market outcomes, marriage, fertility, contraceptive use, health status, use of health care and health insurance, relationships among co-resident and non- resident family members, processes underlying household decision-making, transfers among family members and participation in community activities. In addition to individual- and household-level information, the IFLS provides detailed information from the communities in which IFLS households are located and from the facilities that serve residents of those communities. These data cover aspects of the physical and social environment, infrastructure, employment opportunities, food prices, access to health and educational facilities, and the quality and prices of services available at those facilities. By linking data from IFLS households to data from their communities, users can address many important questions regarding the impact of policies on the lives of the respondents, as well as document the effects of social, economic, and environmental change on the population. The Indonesia Family Life Survey complements and extends the existing survey data available for Indonesia, and for developing countries in general, in a number of ways. First, relatively few large-scale longitudinal surveys are available for developing countries. IFLS is the only large-scale longitudinal survey available for Indonesia. Because data are available for the same individuals from multiple points in time, IFLS affords an opportunity to understand the dynamics of behavior, at the individual, household and family and community levels. In IFLS1 7,224 households were interviewed, and detailed individual-level data were collected from over 22,000 individuals. In IFLS2, 94.4% of IFLS1 households were re-contacted (interviewed or died). In IFLS3 the re-contact rate was 95.3% of IFLS1 households. Indeed nearly 91% of IFLS1 households are complete panel households in that they were interviewed in all three waves, IFLS1, 2 and 3. These re-contact rates are as high as or higher than most longitudinal surveys in the United States and Europe. High re-interview rates were obtained in part because we were committed to tracking and interviewing individuals who had moved or split off from the origin IFLS1 households. High re-interview rates contribute significantly to data quality in a longitudinal survey because they lessen the risk of bias due to nonrandom attrition in studies using the data. Second, the multipurpose nature of IFLS instruments means that the data support analyses of interrelated issues not possible with single-purpose surveys. For example, the availability of data on household consumption together with detailed individual data on labor market outcomes, health outcomes and on health program availability and quality at the community level means that one can examine the impact of income on health outcomes, but also whether health in turn affects incomes. Third, IFLS collected both current and retrospective information on most topics. With data from multiple points of time on current status and an extensive array of retrospective information about the lives of respondents, analysts can relate dynamics to events that occurred in the past. For example, changes in labor outcomes in recent years can be explored as a function of earlier decisions about schooling and work. Fourth, IFLS collected extensive measures of health status, including self-reported measures of general health status, morbidity experience, and physical assessments conducted by a nurse (height, weight, head circumference, blood pressure, pulse, waist and hip circumference, hemoglobin level, lung capacity, and time required to repeatedly rise from a sitting position). These data provide a much richer picture of health status than is typically available in household surveys. For example, the data can be used to explore relationships between socioeconomic status and an array of health outcomes. Fifth, in all waves of the survey, detailed data were collected about respondents¹ communities and public and private facilities available for their health care and schooling. The facility data can be combined with household and individual data to examine the relationship between, for example, access to health services (or changes in access) and various aspects of health care use and health status. Sixth, because the waves of IFLS span the period from several years before the economic crisis hit Indonesia, to just prior to it hitting, to one year and then three years after, extensive research can be carried out regarding the living conditions of Indonesian households during this very tumultuous period. In sum, the breadth and depth of the longitudinal information on individuals, households, communities, and facilities make IFLS data a unique resource for scholars and policymakers interested in the processes of economic development.
To facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are: 1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales. 2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Ethiopia Socioeconomic Survey (ESS) 2018-2019 and Ethiopia COVID-19 High Frequency Phone Survey of Households (HFPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Ethiopia - Socioeconomic Survey 2018-2019” and “Ethiopia - COVID-19 High Frequency Phone Survey of Households 2020” available in the Microdata Library for details.
This indicator provides an estimate of the average number of inhabitants per household per hectare. For each hectare cell, the value of the number of inhabitants is divided by the number of inhabited address points (as a proxy of households) in the hectare cell. For this indicator, the same basic data are followed as for the indicator 'Inhabitant density per ha' and 'Household density per ha' . The population dataset describes the location and the number of inhabitants per address on the basis of a point file. This file is created by Digital Flanders (DV) based on data from the National Register. The location of the addresses is determined by a geocoding of the addresses based on CRAB (CRAB population numbers version 0.1). The number of records in the dataset (i.e. number of addresses) of which the number of inhabitants is not equal to '0' or 'NoData (-99)' is used as a proxy for the number of households. Addresses for which the number of inhabitants is equal to 0 are addresses from the CRAB database that do not appear in the National Register and where no one lives. Addresses with a 'NoData' value are addresses for which the number of inhabitants is not exactly known. This is the case, for example, for subaddresses whose population in the National Register is only known at house number level and not per subaddress. The total number of households in Flanders according to this assumption is lower than the total number of households reported by the FPS Economy. The dataset is corrected for this difference at the level of the municipalities on the basis of the statistics of the number of households per municipality. To this end, the sum of the number of households per municipality is first made on the basis of the points file. Subsequently, the factor of this sum is calculated in relation to the reported number of households by Statbel. This factor is then applied to all address points. For example, if 5% fewer households are calculated in a certain municipality on the basis of the points file compared to the statistics, the number of households is increased by 5% at each point location. More details about the development of this product and the accompanying figures are therefore now referred to the technical report 'Indicators Spatial Efficiency, state and evolution 2013-2019 - technical data sheets' that you can find at https://archief-algemeen.milieu.vlaanderen .be/xmlui/handle/acd/762878
The General Household Survey (GHS) is a continuous national survey of people living in private households conducted on an annual basis, by the Social Survey Division of the Office for National Statistics (ONS). The main aim of the survey is to collect data on a range of core topics, covering household, family and individual information. This information is used by government departments and other organisations for planning, policy and monitoring purposes, and to present a picture of house holds, family and people in Great Britain. From 2008, the General Household Survey became a module of the Integrated Household Survey (IHS). In recognition, the survey was renamed the General Lifestyle Survey (GLF/GLS). The GHS started in 1971 and has been carried out continuously since then, except for breaks in 1997-1998 when the survey was reviewed, and 1999-2000 when the survey was redeveloped. Following the 1997 review, the survey was relaunched from April 2000 with a different design. The relevant development work and the changes made are fully described in the Living in Britain report for the 2000-2001 survey. Following its review, the GHS was changed to comprise two elements: the continuous survey and extra modules, or 'trailers'. The continuous survey remained unchanged from 2000 to 2004, apart from essential adjustments to take account of, for example, changes in benefits and pensions. The GHS retained its modular structure and this allowed a number of different trailers to be included for each of those years, to a plan agreed by sponsoring government departments. Further changes to the GHS methodology from 2005: From April 1994 to 2005, the GHS was conducted on a financial year basis, with fieldwork spread evenly from April of one year to March the following year. However, in 2005 the survey period reverted to a calendar year and the whole of the annual sample was surveyed in the nine months from April to December 2005. Future surveys will run from January to December each year, hence the title date change to single year from 2005 onwards. Since the 2005 GHS (held under SN 5640) does not cover the January-March quarter, this affects annual estimates for topics which are subject to seasonal variation. To rectify this, where the questions were the same in 2005 as in 2004-2005, the final quarter of the latter survey was added (weighted in the correct proportion) to the nine months of the 2005 survey. Furthermore, in 2005, the European Union (EU) made a legal obligation (EU-SILC) for member states to collect additional statistics on income and living conditions. In addition to this the EU-SILC data cover poverty and social exclusion. These statistics are used to help plan and monitor European social policy by comparing poverty indicators and changes over time across the EU. The EU-SILC requirement has been integrated into the GHS, leading to large-scale changes in the 2005 survey questionnaire. The trailers on 'Views of your Local Area' and 'Dental Health' have been removed. Other changes have been made to many of the standard questionnaire sections, details of which may be found in the GHS 2005 documentation. Further changes to the GLF/GHS methodology from 2008 As noted above, the General Household Survey (GHS) was renamed the General Lifestyle Survey (GLF/GLS) in 2008. The sample design of the GLF/GLS is the same as the GHS before, and the questionnaire remains largely the same. The main change is that the GLF now includes the IHS core questions, which are common to all of the separate modules that together comprise the IHS. Some of these core questions are simpl y questions that were previously asked in the same or a similar format on all of the IHS component surveys (including the GLF/GLS). The core questions cover employment, smoking prevalence, general health, ethnicity, citizenship and national identity. These questions are asked by proxy if an interview is not possible with the selected respondent (that is a member of the household can answer on behalf of other respondents in the household). This is a departure from the GHS which did not ask smoking prevalence and general health questions by proxy, whereas the GLF/GLS does from 2008. For details on other changes to the GLF/GLS questionnaire, please see the GLF/GLS 2008: Special Licence Access documentation held with SN 6414. Currently, the UK Data Archive holds only the SL (and not the EUL) version of the GLF/GLS for 2008. Changes to the drinking section There have been a number of revisions to the methodology that is used to produce the alcohol consumption estimates. In 2006, the average number of units assigned to the different drink types and the assumption around the average size of a wine glass was updated, resulting in significantly increased consumption estimates. In addition to the revised method, a new question about wine glass size was included in the survey in 2008. Respondents were asked whether they have consumed small (125 ml), standard (175 ml) or large (250 ml) glasses of wine. The data from this question are used when calculating the number of units of alcohol consumed by the respondent. It is assumed that a small glass contains 1.5 units, a standard glass contains 2 units and a large glass contains 3 units. (In 2006 and 2007 it was assumed that all respondents drank from a standard 175 ml glass containing 2 units.) The datasets contain the original set of variables based on the original methodology, as well as those based on the revised and (for 2008 onwards) updated methodologies. Further details on these changes are provided in the Guidelines documents held in SN 5804 - GHS 2006; and SN 6414 - GLF/GLS 2008: Special Licence Access. Special Licence GHS/GLF/GLS Special Licence (SL) versions of the GHS/GLF/GLS are available from 1998-1999 onwards. The SL versions include all variables held in the standard 'End User Licence' (EUL) version, plus extra variables covering cigarette codes and descriptions, and some birthdate information for respondents and household members. Prospective SL users will need to complete an extra application form and demonstrate to the data owners exactly why they need access to t he extra variables, in order to get permission to use the SL version. Therefore, most users should order the EUL version of the data. In order to help users choose the correct dataset, 'Special Licence Access' has been added to the dataset titles for the SL versions of the data. A list of all GHS/GLF/GLS studies available from the UK Data Archive may be found on the GHS/GLF/GLS major studies web page. See below for details of SL datasets for the corresponding GHS/GLF/GLS year (1998-1999 onwards only). UK Data Archive data holdings and formats The UK Data Archive GHS/GLF/GLS holdings begin with the 1971 study for EUL data, and from 1998-1999 for SL versions (see above). Users should note that data for the 1971 study are currently only available as ASCII files without accompanying SPSS set-up files. SPSS files for the 1972 study were created by John Simister, and redeposited at the Archive in 2000. Currently, the UK Data Archive holds only the SL versions of the GHS/GLF/GLS for 2007 and 2008. Reformatted Data 1973 to 1982 - Surrey SPSS Files SPSS files have been created by the University of Surrey for all study years from 1973 to 1982 inclusive. These early files were restructured and the case changed from the household to the individual with all of the household information duplicated for each individual. The Surrey SPSS files contain all the original variabl es as well as some extra derived variables (a few variables were omitted from the data files for 1973-76). In 1973 only, the section on leisure was not included in the Surrey SPSS files. This has subsequently been made available, however, and is now held in a separate study, General Household Survey, 1973: Leisure Questions (held under SN 3982). Records for the original GHS 1973-1982 ASCII files have been removed from the UK Data Archive catalogue, but the data are still preserved and available upon request. Users should note that GHS/GLF/GLS data are also available in formats other than SPSS.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Red House town by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for Red House town. The dataset can be utilized to understand the population distribution of Red House town by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in Red House town. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for Red House town.
Key observations
Largest age group (population): Male # 40-44 years (5) | Female # 40-44 years (5). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Age groups:
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Red House town Population by Gender. You can refer the same here
https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de442616https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de442616
Abstract (en): The Public Use Microdata Samples (PUMS) contain person- and household-level information from the "long-form" questionnaires distributed to a sample of the population enumerated in the 1980 Census. This data collection, containing 5-percent data, identifies every state, county groups, and most individual counties with 100,000 or more inhabitants (350 in all). In many cases, individual cities or groups of places with 100,000 or more inhabitants are also identified. Household-level variables include housing tenure, year structure was built, number and types of rooms in dwelling, plumbing facilities, heating equipment, taxes and mortgage costs, number of children, and household and family income. The person record contains demographic items such as sex, age, marital status, race, Spanish origin, income, occupation, transportation to work, and education. All persons and housing units in the United States and Puerto Rico. For this data collection, the full 1980 Census sample that received the "long-form" questionnaire (19.4 percent of all households) was sampled again through a stratified systematic selection procedure with probability proportional to a measure of size. This 5-percent sample, i.e., 5 households for every 100 households in the nation, includes over one-fourth of the households that received the long-form questionnaire. 2006-01-12 All files were removed from dataset 81 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 80 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 81 and flagged as study-level files, so that they will accompany all downloads.2006-01-12 All files were removed from dataset 80 and flagged as study-level files, so that they will accompany all downloads.1997-08-25 Part 72, Puerto Rico data, has been added to the collection, as well as supplemental documentation for Puerto Rico in the form of a separate PDF file. The household and person records in each hierarchical data file have logical record lengths of 193 characters, but the number of records varies with each file.The record layout for Part 72, Puerto Rico, is different from the state datasets. Refer to the supplemental documentation for this part.The codebook is available in hardcopy form only, while the Puerto Rico supplemental documentation is provided as a Portable Document Format (PDF) file.
This map indicator gives an estimate of the number of households per hectare, corrected per municipality. For this indicator, the same basic data are used as for the indicator 'Inhabitant density per ha'. The population dataset describes the location and the number of inhabitants per address on the basis of a point file. This file is created by Digital Flanders (DV) based on data from the National Register. The location of the addresses is determined by a geocoding of the addresses based on CRAB (CRAB population numbers version 0.1). The number of records in the dataset (i.e. number of addresses) of which the number of inhabitants is not equal to '0' or 'NoData (-99)' is used as a proxy for the number of households. Addresses for which the number of inhabitants is equal to 0 are addresses from the CRAB database that do not appear in the National Register and where no one lives. Addresses with a 'NoData' value are addresses for which the number of inhabitants is not exactly known. This is the case, for example, for subaddresses whose population in the National Register is only known at house number level and not per subaddress. The total number of households in Flanders according to this assumption is lower than the total number of households reported by the FPS Economy. The dataset is corrected for this difference at the level of the municipalities on the basis of the statistics of the number of households per municipality. To this end, the sum of the number of households per municipality is first made on the basis of the points file. Subsequently, the factor of this sum is calculated in relation to the reported number of households by Statbel. This factor is then applied to all address points. If, for example, 5% fewer households are calculated in a certain municipality on the basis of the points file compared to the StatBel statistics, the number of households is increased by 5% at each point location. This corrected point file is then scaled up to a resolution of 1 ha by making the sum of the corrected number of households for each hectare cell. For more details about the creation of this product and the accompanying figures, reference is now made to the technical report 'Indicators Spatial Efficiency, condition and evolution 2013-2019 - technical data sheets' that you can find at https://archief-algemeen.milieu. vlaanderen.be/xmlui/handle/acd/762878
Household income is a potential predictor for a number of environmental influences, for example, application of urban pesticides. This product is a U.S. conterminous mapping of block group income derived from the 2010-2014 Census American Community Survey (ACS), adjusted by a 2013 county-level Cost-of-Living index obtained from the Council for Community and Economic Research. The resultant raster is provided at 200-m spatial resolution, in units of adjusted household income in thousands of dollars per year.
To facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are:
1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Nigeria General Household Survey, Panel (GHS-Panel) 2018-2019 and Nigeria COVID-19 National Longitudinal Phone Survey (COVID-19 NLPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Survey based Harmonized Indicators (SHIP) files are harmonized data files from household surveys that are conducted by countries in Africa. To ensure the quality and transparency of the data, it is critical to document the procedures of compiling consumption aggregation and other indicators so that the results can be duplicated with ease. This process enables consistency and continuity that make temporal and cross-country comparisons consistent and more reliable.
Four harmonized data files are prepared for each survey to generate a set of harmonized variables that have the same variable names. Invariably, in each survey, questions are asked in a slightly different way, which poses challenges on consistent definition of harmonized variables. The harmonized household survey data present the best available variables with harmonized definitions, but not identical variables. The four harmonized data files are
a) Individual level file (Labor force indicators in a separate file): This file has information on basic characteristics of individuals such as age and sex, literacy, education, health, anthropometry and child survival. b) Labor force file: This file has information on labor force including employment/unemployment, earnings, sectors of employment, etc. c) Household level file: This file has information on household expenditure, household head characteristics (age and sex, level of education, employment), housing amenities, assets, and access to infrastructure and services. d) Household Expenditure file: This file has consumption/expenditure aggregates by consumption groups according to Purpose (COICOP) of Household Consumption of the UN.
National
The survey covered all de jure household members (usual residents).
Sample survey data [ssd]
Sample Frame The list of households obtained from the 2001/2 Ethiopian Agricultural Sample Enumeration (EASE) was used as a frame to select EAs from the rural part of the country. On the other hand, the list consisting of households by EA, which was obtained from the 2004 Ethiopian Urban Economic Establishment Census, (EUEEC), was used as a frame in order to select sample enumeration areas for the urban HICE survey. A fresh list of households from each urban and rural EA was prepared at the beginning of the survey period. This list was, thus, used as a frame in order to select households from sample EAs.
Sample Design For the purpose of the survey the country was divided into three broad categories. That is; rural, major urban center and other urban center categories.
Category I: Rural: - This category consists of the rural areas of eight regional states and two administrative councils (Addis Ababa and Dire Dawa) of the country, except Gambella region. Each region was considered to be a domain (Reporting Level) for which major findings of the survey are reported. This category comprises 10 reporting levels. A stratified two-stage cluster sample design was used to select samples in which the primary sampling units (PSUs) were EAs. Twelve households per sample EA were selected as a Second Stage Sampling Unit (SSU) to which the survey questionnaire were administered.
Category II:- Major urban centers:- In this category all regional capitals (except Gambella region) and four additional urban centers having higher population sizes as compared to other urban centers were included. Each urban center in this category was considered as a reporting level. However, each sub-city of Addis Ababa was considered to be a domain (reporting levels). Since there is a high variation in the standards of living of the residents of these urban centers (that may have a significant impact on the final results of the survey), each urban center was further stratified into the following three sub-strata. Sub-stratum 1:- Households having a relatively high standards of living Sub-stratum 2:- Households having a relatively medium standards of living and Sub-stratum 3:- Households having a relatively low standards of living. The category has a total of 14 reporting levels. A stratified two-stage cluster sample design was also adopted in this instance. The primary sampling units were EAs of each urban center. Allocation of sample EAs of a reporting level among the above mentioned strata were accomplished in proportion to the number of EAs each stratum consists of. Sixteen households from each sample EA were inally selected as a Secondary Sampling Unit (SSU).
Category III: - Other urban centers: - Urban centers in the country other than those under category II were grouped into this category. Excluding Gambella region a domain of "other urban centers" is formed for each region. Consequently, 7 reporting levels were formed in this category. Harari, Addis Ababa and Dire Dawa do not have urban centers other than that grouped in category II. Hence, no domain was formed for these regions under this category. Unlike the above two categories a stratified three-stage cluster sample design was adopted to select samples from this category. The primary sampling units were urban centers and the second stage sampling units were EAs. Sixteen households from each EA were lastly selected at the third stage and the survey questionnaires administered for all of them.
Face-to-face [f2f]
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset contains counts and measures for households from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.
The variables included in this dataset are for households in occupied private dwellings (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated):
Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.
Footnotes
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Concept descriptions and quality ratings
Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.
Household crowding
Household crowding is based on the Canadian National Occupancy Standard (CNOS). It calculates the number of bedrooms needed based on the demographic composition of the household. The household crowding index methodology for 2023 Census has been updated to use gender instead of sex. Household crowding should be used with caution for small geographical areas due to high volatility between census years as a result of population change and urban development. There may be additional volatility in areas affected by the cyclone, particularly in Gisborne and Hawke's Bay. Household crowding index – 2023 Census has details on how the methodology has changed, differences from 2018 Census, and more.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Measures
Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.
Percentages
To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.
Symbol
-997 Not available
-999 Confidential
Inconsistencies in definitions
Please note that there may be differences in definitions between census classifications and those used for other data collections.
The study included four separate surveys:
The survey of Family Income Support (MOP in Serbian) recipients in 2002 These two datasets are published together.
The LSMS survey of general population of Serbia in 2003 (panel survey)
The survey of Roma from Roma settlements in 2003 These two datasets are published together separately from the 2002 datasets.
Objectives
LSMS represents multi-topical study of household living standard and is based on international experience in designing and conducting this type of research. The basic survey was carried out in 2002 on a representative sample of households in Serbia (without Kosovo and Metohija). Its goal was to establish a poverty profile according to the comprehensive data on welfare of households and to identify vulnerable groups. Also its aim was to assess the targeting of safety net programs by collecting detailed information from individuals on participation in specific government social programs. This study was used as the basic document in developing Poverty Reduction Strategy (PRS) in Serbia which was adopted by the Government of the Republic of Serbia in October 2003.
The survey was repeated in 2003 on a panel sample (the households which participated in 2002 survey were re-interviewed).
Analysis of the take-up and profile of the population in 2003 was the first step towards formulating the system of monitoring in the Poverty Reduction Strategy (PRS). The survey was conducted in accordance with the same methodological principles used in 2002 survey, with necessary changes referring only to the content of certain modules and the reduction in sample size. The aim of the repeated survey was to obtain panel data to enable monitoring of the change in the living standard within a period of one year, thus indicating whether there had been a decrease or increase in poverty in Serbia in the course of 2003. [Note: Panel data are the data obtained on the sample of households which participated in the both surveys. These data made possible tracking of living standard of the same persons in the period of one year.]
Along with these two comprehensive surveys, conducted on national and regional representative samples which were to give a picture of the general population, there were also two surveys with particular emphasis on vulnerable groups. In 2002, it was the survey of living standard of Family Income Support recipients with an aim to validate this state supported program of social welfare. In 2003 the survey of Roma from Roma settlements was conducted. Since all present experiences indicated that this was one of the most vulnerable groups on the territory of Serbia and Montenegro, but with no ample research of poverty of Roma population made, the aim of the survey was to compare poverty of this group with poverty of basic population and to establish which categories of Roma population were at the greatest risk of poverty in 2003. However, it is necessary to stress that the LSMS of the Roma population comprised potentially most imperilled Roma, while the Roma integrated in the main population were not included in this study.
The surveys were conducted on the whole territory of Serbia (without Kosovo and Metohija).
Sample survey data [ssd]
Sample frame for both surveys of general population (LSMS) in 2002 and 2003 consisted of all permanent residents of Serbia, without the population of Kosovo and Metohija, according to definition of permanently resident population contained in UN Recommendations for Population Censuses, which were applied in 2002 Census of Population in the Republic of Serbia. Therefore, permanent residents were all persons living in the territory Serbia longer than one year, with the exception of diplomatic and consular staff.
The sample frame for the survey of Family Income Support recipients included all current recipients of this program on the territory of Serbia based on the official list of recipients given by Ministry of Social affairs.
The definition of the Roma population from Roma settlements was faced with obstacles since precise data on the total number of Roma population in Serbia are not available. According to the last population Census from 2002 there were 108,000 Roma citizens, but the data from the Census are thought to significantly underestimate the total number of the Roma population. However, since no other more precise data were available, this number was taken as the basis for estimate on Roma population from Roma settlements. According to the 2002 Census, settlements with at least 7% of the total population who declared itself as belonging to Roma nationality were selected. A total of 83% or 90,000 self-declared Roma lived in the settlements that were defined in this way and this number was taken as the sample frame for Roma from Roma settlements.
Planned sample: In 2002 the planned size of the sample of general population included 6.500 households. The sample was both nationally and regionally representative (representative on each individual stratum). In 2003 the planned panel sample size was 3.000 households. In order to preserve the representative quality of the sample, we kept every other census block unit of the large sample realized in 2002. This way we kept the identical allocation by strata. In selected census block unit, the same households were interviewed as in the basic survey in 2002. The planned sample of Family Income Support recipients in 2002 and Roma from Roma settlements in 2003 was 500 households for each group.
Sample type: In both national surveys the implemented sample was a two-stage stratified sample. Units of the first stage were enumeration districts, and units of the second stage were the households. In the basic 2002 survey, enumeration districts were selected with probability proportional to number of households, so that the enumeration districts with bigger number of households have a higher probability of selection. In the repeated survey in 2003, first-stage units (census block units) were selected from the basic sample obtained in 2002 by including only even numbered census block units. In practice this meant that every second census block unit from the previous survey was included in the sample. In each selected enumeration district the same households interviewed in the previous round were included and interviewed. On finishing the survey in 2003 the cases were merged both on the level of households and members.
Stratification: Municipalities are stratified into the following six territorial strata: Vojvodina, Belgrade, Western Serbia, Central Serbia (Šumadija and Pomoravlje), Eastern Serbia and South-east Serbia. Primary units of selection are further stratified into enumeration districts which belong to urban type of settlements and enumeration districts which belong to rural type of settlement.
The sample of Family Income Support recipients represented the cases chosen randomly from the official list of recipients provided by Ministry of Social Affairs. The sample of Roma from Roma settlements was,as in the national survey, a two-staged stratified sample, but the units in the first stage were settlements where Roma population was represented in the percentage over 7%, and the units of the second stage were Roma households. Settlements are stratified in three territorial strata: Vojvodina, Beograd and Central Serbia.
Face-to-face [f2f]
In all surveys the same questionnaire with minimal changes was used. It included different modules, topically separate areas which had an aim of perceiving the living standard of households from different angles. Topic areas were the following: 1. Roster with demography. 2. Housing conditions and durables module with information on the age of durables owned by a household with a special block focused on collecting information on energy billing, payments, and usage. 3. Diary of food expenditures (weekly), including home production, gifts and transfers in kind. 4. Questionnaire of main expenditure-based recall periods sufficient to enable construction of annual consumption at the household level, including home production, gifts and transfers in kind. 5. Agricultural production for all households which cultivate 10+ acres of land or who breed cattle. 6. Participation and social transfers module with detailed breakdown by programs 7. Labour Market module in line with a simplified version of the Labour Force Survey (LFS), with special additional questions to capture various informal sector activities, and providing information on earnings 8. Health with a focus on utilization of services and expenditures (including informal payments) 9. Education module, which incorporated pre-school, compulsory primary education, secondary education and university education. 10. Special income block, focusing on sources of income not covered in other parts (with a focus on remittances).
During field work, interviewers kept a precise diary of interviews, recording both successful and unsuccessful visits. Particular attention was paid to reasons why some households were not interviewed. Separate marks were given for households which were not interviewed due to refusal and for cases when a given household could not be found on the territory of the chosen census block.
In 2002 a total of 7,491 households were contacted. Of this number a total of 6,386 households in 621 census rounds were interviewed. Interviewers did not manage to collect the data for 1,106 or 14.8% of selected households. Out of this number 634 households or
The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.
To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.
It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.
Face-to-face [f2f]
List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form
Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results
Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format
This dataset is imported from the US Department of Commerce, National Telecommunications and Information Administration (NTIA) and its "Data Explorer" site. The underlying data comes from the US Census
dataset: Specifies the month and year of the survey as a string, in "Mon YYYY" format. The CPS is a monthly survey, and NTIA periodically sponsors Supplements to that survey.
variable: Contains the standardized name of the variable being measured. NTIA identified the availability of similar data across Supplements, and assigned variable names to ease time-series comparisons.
description: Provides a concise description of the variable.
universe: Specifies the variable representing the universe of persons or households included in the variable's statistics. The specified variable is always included in the file. The only variables lacking universes are isPerson and isHouseholder, as they are themselves the broadest universes measured in the CPS.
A large number of *Prop, *PropSE, *Count, and *CountSE columns comprise the remainder of the columns. For each demographic being measured (see below), four statistics are produced, including the estimated proportion of the group for which the variable is true (*Prop), the standard error of that proportion (*PropSE), the estimated number of persons or households in that group for which the variable is true (*Count), and the standard error of that count (*CountSE).
DEMOGRAPHIC CATEGORIES
us: The usProp, usPropSE, usCount, and usCountSE columns contain statistics about all persons and households in the universe (which represents the population of the fifty states and the District and Columbia). For example, to see how the prevelance of Internet use by Americans has changed over time, look at the usProp column for each survey's internetUser variable.
age: The age category is divided into five ranges: ages 3-14, 15-24, 25-44, 45-64, and 65+. The CPS only includes data on Americans ages 3 and older. Also note that household reference persons must be at least 15 years old, so the age314* columns are blank for household-based variables. Those columns are also blank for person-based variables where the universe is "isAdult" (or a sub-universe of "isAdult"), as the CPS defines adults as persons ages 15 or older. Finally, note that some variables where children are technically in the univese will show zero values for the age314* columns. This occurs in cases where a variable simply cannot be true of a child (e.g. the workInternetUser variable, as the CPS presumes children under 15 are not eligible to work), but the topic of interest is relevant to children (e.g. locations of Internet use).
work: Employment status is divided into "Employed," "Unemployed," and "NILF" (Not in the Labor Force). These three categories reflect the official BLS definitions used in official labor force statistics. Note that employment status is only recorded in the CPS for individuals ages 15 and older. As a result, children are excluded from the universe when calculating statistics by work status, even if they are otherwise considered part of the universe for the variable of interest.
income: The income category represents annual family income, rather than just an individual person's income. It is divided into five ranges: below $25K, $25K-49,999, $50K-74,999, $75K-99,999, and $100K or more. Statistics by income group are only available in this file for Supplements beginning in 2010; prior to 2010, family income range is available in public use datasets, but is not directly comparable to newer datasets due to the 2010 introduction of the practice of allocating "don't know," "refused," and other responses that result in missing data. Prior to 2010, family income is unkown for approximately 20 percent of persons, while in 2010 the Census Bureau began imputing likely income ranges to replace missing data.
education: Educational attainment is divided into "No Diploma," "High School Grad," "Some College," and "College Grad." High school graduates are considered to include GED completers, and those with some college include community college attendees (and graduates) and those who have attended certain postsecondary vocational or technical schools--in other words, it signifies additional education beyond high school, but short of attaining a bachelor's degree or equivilent. Note that educational attainment is only recorded in the CPS for individuals ages 15 and older. As a result, children are excluded from the universe when calculating statistics by education, even if they are otherwise considered part of the universe for the variable of interest.
sex: "Male" and "Female" are the two groups in this category. The CPS does not currently provide response options for intersex individuals.
race: This category includes "White," "Black," "Hispanic," "Asian," "Am Indian," and "Other" groups. The CPS asks about Hispanic origin separately from racial identification; as a result, all persons identifying as Hispanic are in the Hispanic group, regardless of how else they identify. Furthermore, all non-Hispanic persons identifying with two or more races are tallied in the "Other" group (along with other less-prevelant responses). The Am Indian group includes both American Indians and Alaska Natives.
disability: Disability status is divided into "No" and "Yes" groups, indicating whether the person was identified as having a disability. Disabilities screened for in the CPS include hearing impairment, vision impairment (not sufficiently correctable by glasses), cognitive difficulties arising from physical, mental, or emotional conditions, serious difficulty walking or climbing stairs, difficulty dressing or bathing, and difficulties performing errands due to physical, mental, or emotional conditions. The Census Bureau began collecting data on disability status in June 2008; accordingly, this category is unavailable in Supplements prior to that date. Note that disability status is only recorded in the CPS for individuals ages 15 and older. As a result, children are excluded from the universe when calculating statistics by disability status, even if they are otherwise considered part of the universe for the variable of interest.
metro: Metropolitan status is divided into "No," "Yes," and "Unkown," reflecting information in the dataset about the household's location. A household located within a metropolitan statistical area is assigned to the Yes group, and those outside such areas are assigned to No. However, due to the risk of de-anonymization, the metropolitan area status of certain households is unidentified in public use datasets. In those cases, the Census Bureau has determined that revealing this geographic information poses a disclosure risk. Such households are tallied in the Unknown group.
scChldHome:
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.