77 datasets found
  1. Data from: Data Fission: Splitting a Single Data Point

    • tandf.figshare.com
    txt
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas (2023). Data Fission: Splitting a Single Data Point [Dataset]. http://doi.org/10.6084/m9.figshare.24328745.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.

  2. 2023 Census totals by topic for families and extended families by...

    • datafinder.stats.govt.nz
    csv, dwg, geodatabase +6
    Updated Nov 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats NZ (2024). 2023 Census totals by topic for families and extended families by statistical area 2 [Dataset]. https://datafinder.stats.govt.nz/layer/120891-2023-census-totals-by-topic-for-families-and-extended-families-by-statistical-area-2/
    Explore at:
    mapinfo tab, geopackage / sqlite, shapefile, kml, csv, geodatabase, pdf, mapinfo mif, dwgAvailable download formats
    Dataset updated
    Nov 24, 2024
    Dataset provided by
    Statistics New Zealandhttp://www.stats.govt.nz/
    Authors
    Stats NZ
    License

    https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/

    Area covered
    Description

    Dataset contains counts and measures for families and extended families from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.

    The variables included in this dataset are for families and extended families in households in occupied private dwellings:

    • Count of families
    • Family type
    • Number of people in family
    • Average number of people in family
    • Total family income
    • Median ($) total family income
    • Count of extended families
    • Extended family type
    • Total extended family income
    • Median ($) total extended family income.

    Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

    Footnotes

    Geographical boundaries

    Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

    Caution using time series

    Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

    About the 2023 Census dataset

    For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

    Data quality

    The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

    Concept descriptions and quality ratings

    Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

    Using data for good

    Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

    Confidentiality

    The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

    Measures

    Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.

    Percentages

    To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.

    Symbol

    -997 Not available

    -999 Confidential

    Inconsistencies in definitions

    Please note that there may be differences in definitions between census classifications and those used for other data collections.

  3. f

    Living Standards Survey, Wave 3 (extension), 2007-2008 - Timor-Leste

    • microdata.fao.org
    Updated Nov 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Statistics Directorate (2022). Living Standards Survey, Wave 3 (extension), 2007-2008 - Timor-Leste [Dataset]. https://microdata.fao.org/index.php/catalog/1507
    Explore at:
    Dataset updated
    Nov 8, 2022
    Dataset authored and provided by
    National Statistics Directorate
    Time period covered
    2007 - 2008
    Area covered
    Timor-Leste
    Description

    Abstract

    In 2007-2008 a multi-topic household survey, the Timor Leste Living Standards Survey (LSS-2) was conducted in East Timor with the main objectives of developing a system of poverty monitoring and supporting poverty reduction, and to monitor human development indicators and progress toward the Millennium Development Goals. The LSS-3 extension survey was designed to re-visit one third of the households interviewed under the LSS-2 to explore different facets of household welfare and behaviour in the country, while also being able to make use of information collected in the LSS-2 survey for analytic purposes. The four new topics investigated in the extension survey are:

    • Risk and Vulnerability: This section is designed to help us understand the dimensions and sources of household-level vulnerability to uninsured risks in Timor Leste, and the efficacy and welfare effects of various risk-management strategies (prevention, mitigation, coping) and mechanisms (private as well as public, formal as well as informal) households do (or do not) have access to. The work in Timor Leste is part of a program of analytic work and policy dialogue throughout the EAP region, more information on which can be found on the World Bank website.
    • Land Degradation and Poverty: This section of the questionnaire is designed to identify proximate causes of deforestation through land use patterns and links with poverty; understand strengths and failures of common land resource management institutions (property rights, enforcement); understand the impact of the Siam Weed problem on household welfare.
    • Justice for Poor: The Justice for the Poor/Access to Justice (J4P/A2J) module of the survey will serve mainly as an initial diagnostic for project development in the country. The topics we would be interested in covering would be Dispute Processing/Resolution; Social Legal Norms and Perceptions of Efficiency in Government (Local, Sub-District, District and National level).
    • Access to Financial Services: The financial service work has the following two objectives: (i) to collect data on access to and use financial services (savings and credit), both formal and informal, and (ii) assess the quality of information on access to financial services obtained from head of households vs. from all adults - i.e. is there a bias introduced by not asking all household members, do the characteristics of the head or the household affect this (gender, age, nuclear family, urban, education levels, wealth, etc.).

    Geographic coverage

    National coverage

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    SAMPLE DESIGN FOR THE 2008 EXTENSION SURVEY

    Sampling for the LSS-3 Extension survey was a sub-sample of the original LSS-“ sample. The LSS-2 field work was divided into 52 "weeks", with each week being a random subset of the total sample. The sub-sample was chosen by randomly selecting 19 weeks from the original field work schedule. Each week contained seven Primary Sampling Units (PSUs) for a total of 133 PSUs. In each PSU the teams were to interview 12 of the original 15 households, with the remaining three to serve as replacements. The total nominal sample size was thus 1596.

    Additional interviews: Following the collection and initial analysis of the data, it was determined that data from one district, Manatuto, and partially from another district, Oecussi, were of insufficient quality in certain modules. Therefore, it was decided to repeat the survey in another 25 PSUs of these two districts - six in Manatuto, and 19 in Oecussi. The additional PSUs chosen were randomly selected within the two districts from the remaining non-panel PSUs in the original LSS-2 sample.

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    DATA CLEANING

    The LSS-3 had a significant number of responses in which the response is "other". In general, if the response clear fit into a pre-coded response category, it was recoded into that category during the cleaning and compilation process. Some responses where additional information was provided were not recoded even though they clearly fit into pre-coded categories. For example, agriculture project" would be recoded into the "agriculture" category, while "community garden" would not. Data users can either use the additional information, or re-code into categories as they see fit. Potential Data Quality Issues in 2008 Extension survey

    Data appraisal

    Potential Data Quality Issues in 2008 Extension survey

    Agriculture: Similarly, to the individual roster of the previous section, the plots listed in the previous survey are listed on the pre-printed cover page and all changes noted. The agricultural section, similarly, to the other sections, suffers from problems with open-ended questions. This is particularly the case for the question asking what community restrictions are placed on the clearing of forest land (section 2d). The translation from the original question was vague (using the Tetun word for "boundary" for "restriction,") and therefore many of the responses relate to physical boundaries on the land, such as stone walls and tree lines. Additionally, the translation of all answers from Tetun into English is imperfect, and those wishing to use this information for analytical purposes are advised to also refer to the original Tetun. Analysts should be careful in using the data from the open ended questions because of translation problems. Also, it was noted during the training and field work that many interviewers had significant difficulties understanding definitions with some of the land management and investment questions. In general, however, all agricultural data may be used for analysis, sampling weights w3.

    Finance: It should be noted that the quality of the data for the finance experiment (comparing the knowledge of the household head to that of other household members) was not sufficient for the experiment to be deemed a success. Subsequent spot-checking revealed that in many cases, interviewers asked the household head about the financial activities of various household members instead of asking them directly. Therefore, this data should only be used to measure the access to finance at the household level. The finance sections were not repeated during the additional interviews in the replacement PSUs. Sampling weights w1 should be used when doing any analysis with this data.

    Shocks and Vulnerability: It was determined following the initial round of data collection that the shocks and vulnerability module had some issues with uneven interview quality. Two reasons were listed as potential causes of the data quality issues: (1) fundamental inability to adequately translate both the word and concept of a "shock" into the Timorese context, and (2) incomplete / questionable responses to the health shock questions in particular. Analysis for health shocks should drop the "questionable" households and use the "re-interview" households, sampling weights w2.

    Justice for the Poor: Similar to the shocks and vulnerability module, the justice module included a long series of follow up questions if the household indicated having experienced a dispute during the recall period. Again, the number of disputes experienced by the household seemed extremely low compared to expectations. This was particularly a problem with the Manatuto district in which no disputes were recorded during the first set of TLSLS2-X interviews. Analysis for the disputes section of the justice module should drop the "questionable" households and use the "re-interview" households, sampling weights w2. The justice model also has a number of instances in which the specifications for "other" were not recorded. Every effort was made to ensure this data was as complete as possible, but gaps do remain. Also, data users should use caution when using the imputed rank variable in section 5D. The rank in terms of importance was not explicitly captured in the data entry software, and the rankings therefore had to be imputed from the order they were listed in the original data entry. Inconsistencies may exist in this variable.

  4. d

    Replication Data for: \"A Topic-based Segmentation Model for Identifying...

    • search.dataone.org
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
    Description

    We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...

  5. Top 10 words of two topics with highest absolute values of regression...

    • plos.figshare.com
    xls
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weiran Xu; Koji Eguchi (2023). Top 10 words of two topics with highest absolute values of regression coefficients and the topic coherence measured in NPMI on the training set when K = 15. [Dataset]. http://doi.org/10.1371/journal.pone.0277104.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 19, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Weiran Xu; Koji Eguchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Top 10 words of two topics with highest absolute values of regression coefficients and the topic coherence measured in NPMI on the training set when K = 15.

  6. 2023 Census totals by topic for individuals by statistical area 2 – part 1

    • datafinder.stats.govt.nz
    csv, dwg, geodatabase +6
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats NZ (2024). 2023 Census totals by topic for individuals by statistical area 2 – part 1 [Dataset]. https://datafinder.stats.govt.nz/layer/120897-2023-census-totals-by-topic-for-individuals-by-statistical-area-2-part-1/
    Explore at:
    mapinfo tab, mapinfo mif, csv, dwg, pdf, geodatabase, shapefile, kml, geopackage / sqliteAvailable download formats
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    Statistics New Zealandhttp://www.stats.govt.nz/
    Authors
    Stats NZ
    License

    https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/

    Area covered
    Description

    Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.

    The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated).

    The variables for part 1 of the dataset are:

    • Census usually resident population count
    • Census night population count
    • Age (5-year groups)
    • Age (life cycle groups)
    • Median age
    • Birthplace (NZ born/overseas born)
    • Birthplace (broad geographic areas)
    • Ethnicity (total responses) for level 1 and ‘Other Ethnicity’ grouped by ‘New Zealander’ and ‘Other Ethnicity nec’
    • Māori descent indicator
    • Languages spoken (total responses)
    • Official language indicator
    • Gender
    • Cisgender and transgender status – census usually resident population count aged 15 years and over
    • Sex at birth
    • Rainbow/LGBTIQ+ indicator for the census usually resident population count aged 15 years and over
    • Sexual identity for the census usually resident population count aged 15 years and over
    • Legally registered relationship status for the census usually resident population count aged 15 years and over
    • Partnership status in current relationship for the census usually resident population count aged 15 years and over
    • Number of children born for the sex at birth female census usually resident population count aged 15 years and over
    • Average number of children born for the sex at birth female census usually resident population count aged 15 years and over
    • Religious affiliation (total responses)
    • Cigarette smoking behaviour for the census usually resident population count aged 15 years and over
    • Disability indicator for the census usually resident population count aged 5 years and over
    • Difficulty communicating for the census usually resident population count aged 5 years and over
    • Difficulty hearing for the census usually resident population count aged 5 years and over
    • Difficulty remembering or concentrating for the census usually resident population count aged 5 years and over
    • Difficulty seeing for the census usually resident population count aged 5 years and over
    • Difficulty walking for the census usually resident population count aged 5 years and over
    • Difficulty washing for the census usually resident population count aged 5 years and over.

    Download lookup file for part 1 from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

    Footnotes

    Te Whata

    Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.

    Geographical boundaries

    Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

    Subnational census usually resident population

    The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.

    Population counts

    Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.

    Caution using time series

    Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

    Study participation time series

    In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.

    About the 2023 Census dataset

    For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

    Data quality

    The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

    Concept descriptions and quality ratings

    Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

    Disability indicator

    This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.

    Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.

    Using data for good

    Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

    Confidentiality

    The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

    Measures

    Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.

    Percentages

    To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.

    Symbol

    -997 Not available

    -999 Confidential

    Inconsistencies in definitions

    Please note that there may be differences in definitions between census classifications and those used for other data collections.

  7. 2023 Census totals by topic for individuals by statistical area 2 – part 2

    • datafinder.stats.govt.nz
    csv, dwg, geodatabase +6
    Updated Nov 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats NZ (2024). 2023 Census totals by topic for individuals by statistical area 2 – part 2 [Dataset]. https://datafinder.stats.govt.nz/layer/120898-2023-census-totals-by-topic-for-individuals-by-statistical-area-2-part-2/
    Explore at:
    dwg, mapinfo tab, pdf, mapinfo mif, geodatabase, shapefile, kml, geopackage / sqlite, csvAvailable download formats
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    Statistics New Zealandhttp://www.stats.govt.nz/
    Authors
    Stats NZ
    License

    https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/

    Area covered
    Description

    Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.

    The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification.

    The variables for part 2 of the dataset are:

    • Individual home ownership for the census usually resident population count aged 15 years and over
    • Usual residence 1 year ago indicator
    • Usual residence 5 years ago indicator
    • Years at usual residence
    • Average years at usual residence
    • Years since arrival in New Zealand for the overseas-born census usually resident population count
    • Average years since arrival in New Zealand for the overseas-born census usually resident population count
    • Study participation
    • Main means of travel to education, by usual residence address for the census usually resident population who are studying
    • Main means of travel to education, by education address for the census usually resident population who are studying
    • Highest qualification for the census usually resident population count aged 15 years and over
    • Post-school qualification in New Zealand indicator for the census usually resident population count aged 15 years and over
    • Highest secondary school qualification for the census usually resident population count aged 15 years and over
    • Post-school qualification level of attainment for the census usually resident population count aged 15 years and over
    • Sources of personal income (total responses) for the census usually resident population count aged 15 years and over
    • Total personal income for the census usually resident population count aged 15 years and over
    • Median ($) total personal income for the census usually resident population count aged 15 years and over
    • Work and labour force status for the census usually resident population count aged 15 years and over
    • Job search methods (total responses) for the unemployed census usually resident population count aged 15 years and over
    • Status in employment for the employed census usually resident population count aged 15 years and over
    • Unpaid activities (total responses) for the census usually resident population count aged 15 years and over
    • Hours worked in employment per week for the employed census usually resident population count aged 15 years and over
    • Average hours worked in employment per week for the employed census usually resident population count aged 15 years and over
    • Industry, by usual residence address for the employed census usually resident population count aged 15 years and over
    • Industry, by workplace address for the employed census usually resident population count aged 15 years and over
    • Occupation, by usual residence address for the employed census usually resident population count aged 15 years and over
    • Occupation, by workplace address for the employed census usually resident population count aged 15 years and over
    • Main means of travel to work, by usual residence address for the employed census usually resident population count aged 15 years and over
    • Main means of travel to work, by workplace address for the employed census usually resident population count aged 15 years and over
    • Sector of ownership for the employed census usually resident population count aged 15 years and over
    • Individual unit data source.

    Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

    Footnotes

    Te Whata

    Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.

    Geographical boundaries

    Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

    Subnational census usually resident population

    The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.

    Population counts

    Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.

    Caution using time series

    Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

    Study participation time series

    In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.

    About the 2023 Census dataset

    For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

    Data quality

    The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

    Concept descriptions and quality ratings

    Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

    Disability indicator

    This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.

    Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.

    Using data for good

    Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

    Confidentiality

    The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

    Measures

    Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures

  8. Top 10 words of two topics with highest absolute values of regression...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weiran Xu; Koji Eguchi (2023). Top 10 words of two topics with highest absolute values of regression coefficients. [Dataset]. http://doi.org/10.1371/journal.pone.0277104.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Weiran Xu; Koji Eguchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Top 10 words of two topics with highest absolute values of regression coefficients.

  9. Baseline multiple linear regression model with end fitness as the response...

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs (2023). Baseline multiple linear regression model with end fitness as the response variable, showing the calculated variable inflation factors (VIFs). [Dataset]. http://doi.org/10.1371/journal.pone.0211776.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Baseline multiple linear regression model with end fitness as the response variable, showing the calculated variable inflation factors (VIFs).

  10. Inability to keep home adequately warm by NUTS 2 region

    • ec.europa.eu
    Updated Mar 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eurostat (2025). Inability to keep home adequately warm by NUTS 2 region [Dataset]. http://doi.org/10.2908/ILC_MDES01_R
    Explore at:
    application/vnd.sdmx.genericdata+xml;version=2.1, application/vnd.sdmx.data+csv;version=2.0.0, application/vnd.sdmx.data+csv;version=1.0.0, tsv, json, application/vnd.sdmx.data+xml;version=3.0.0Available download formats
    Dataset updated
    Mar 22, 2025
    Dataset authored and provided by
    Eurostathttps://ec.europa.eu/eurostat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2021 - 2024
    Area covered
    Molise, Helsinki-Uusimaa, Stockholm, Leipzig, Principado de Asturias, Zealand, Makroregion centralny, Zachodniopomorskie, Thüringen, Budapest
    Description

    The European Union Statistics on Income and Living Conditions (EU-SILC) collects timely and comparable multidimensional microdata on income, poverty, social exclusion and living conditions.

    The EU-SILC collection is a key instrument for providing information required by the European Semester ([1]) and the European Pillar of Social Rights, and the main source of data for microsimulation purposes and flash estimates of income distribution and poverty rates.

    AROPE remains crucial to monitor European social policies, especially to monitor the EU 2030 target on poverty and social exclusion. For more information, please consult EU social indicators.

    The EU-SILC instrument provides two types of data:

    • Cross-sectional data pertaining to a given time or a certain time period with variables on income, poverty, social exclusion and other living conditions.
    • Longitudinal data pertaining to individual-level changes over time, observed periodically over four‐or more year rotation scheme (Annex III (2) of 2019/1700).

    EU-SILC collects:

    • annual variables,
    • three-yearly modules,
    • six-yearly modules,
    • ad-hoc new policy needs modules,
    • optional variables.

    The variables collected are grouped by topic and detailed topic and transmitted to Eurostat in four main files (D-File, H-File, R-File and P-file).

    The domain ‘Income and Living Conditions’ covers the following topics: persons at risk of poverty or social exclusion, income inequality, income distribution and monetary poverty, living conditions, material deprivation, and EU-SILC ad-hoc modules, which are structured into collections of indicators on specific topics.

    In 2023, in addition to annual data, in EU-SILC were collected: the three yearly module on labour market and housing, the six yearly module on intergenerational transmission of advantages and disadvantages, housing difficulties, and the ad hoc subject on households energy efficiency.

    Starting from 2021 onwards, the EU quality reports use the structure of the Single Integrated Metadata Structure (SIMS).

    ([1]) The European Semester is the European Union’s framework for the coordination and surveillance of economic and social policies.

  11. Descriptive statistics of the variables.

    • plos.figshare.com
    xls
    Updated Jan 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baodong Chen; Jing Li; Jiayi Zhang (2025). Descriptive statistics of the variables. [Dataset]. http://doi.org/10.1371/journal.pone.0317892.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 31, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Baodong Chen; Jing Li; Jiayi Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Corporate financialization is a growing concern in China, and its impact on the main business of real enterprises is a crucial topic. This paper uses data from all A-share non-financial listed companies in China between 2013 and 2022 to establish a dynamic panel threshold model and test the effect of corporate financialization on enterprise performance. The empirical results indicate a threshold effect between the two variables, corporate financialization has both positive and negative effects on main business performance, with a threshold of 5.82%. Additionally, significant heterogeneous results are found for the nature of ownership, asset maturity, industry and regional distribution.

  12. MSE and sample standard deviation on test set of movie rating score...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weiran Xu; Koji Eguchi (2023). MSE and sample standard deviation on test set of movie rating score prediction when K = 15. [Dataset]. http://doi.org/10.1371/journal.pone.0277104.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Weiran Xu; Koji Eguchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MSE and sample standard deviation on test set of movie rating score prediction when K = 15.

  13. Michigan Public Policy Survey, 2009 - 2016

    • icpsr.umich.edu
    Updated Feb 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Michigan. Center for Local, State, and Urban Policy (2024). Michigan Public Policy Survey, 2009 - 2016 [Dataset]. http://doi.org/10.3886/ICPSR39057.v1
    Explore at:
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    University of Michigan. Center for Local, State, and Urban Policy
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39057/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39057/terms

    Time period covered
    Apr 23, 2009 - Dec 13, 2016
    Area covered
    Michigan
    Description

    The Michigan Public Policy Survey (MPPS) is a program of state-wide surveys of local government leaders in Michigan. The MPPS is designed to fill an important information gap in the policymaking process. While there are ongoing surveys of the business community and of the citizens of Michigan, before the MPPS there were no ongoing surveys of local government officials that were representative of all general purpose local governments in the state. Therefore, while we knew the policy priorities and views of the state's businesses and citizens, we knew very little about the views of the local officials who are so important to the economies and community life throughout Michigan. The MPPS was launched in 2009 by the Center for Local, State, and Urban Policy (CLOSUP) at the University of Michigan and is conducted in partnership with the Michigan Association of Counties, Michigan Municipal League, and Michigan Townships Association. The associations provide CLOSUP with contact information for the survey's respondents, and consult on survey topics. CLOSUP makes all decisions on survey design, data analysis, and reporting, and receives no funding support from the associations. The surveys investigate local officials' opinions and perspectives on a variety of important public policy issues and solicit factual information about their localities relevant to policymaking. Over time, the program has covered issues such as fiscal, budgetary and operational policy, fiscal health, public sector compensation, workforce development, local-state governmental relations, intergovernmental collaboration, economic development strategies and initiatives such as placemaking and economic gardening, the role of local government in environmental sustainability, energy topics such as hydraulic fracturing ("fracking") and wind power, trust in government, views on state policymaker performance, opinions on the impacts of the Federal Stimulus Program (ARRA), and more. The program will investigate many other issues relevant to local and state policy in the future. A searchable database of every question the MPPS has asked is available on CLOSUP's website. Results of MPPS surveys are currently available as reports, and via online data tables. The MPPS datasets are being released in two forms: public-use datasets and restricted-use datasets. Unlike the public-use datasets, the restricted-use datasets represent full MPPS survey waves, and include all of the survey questions from a wave. Restricted-use datasets also allow for multiple waves to be linked together for longitudinal analysis. The MPPS staff do still modify these restricted-use datasets to remove jurisdiction and respondent identifiers and to recode other variables in order to protect confidentiality. However, it is theoretically possible that a researcher might be able, in some rare cases, to use enough variables from a full dataset to identify a unique jurisdiction, so access to these datasets is restricted and approved on a case-by-case basis. CLOSUP encourages researchers interested in the MPPS to review the codebooks included in this data collection to see the full list of variables including those not found in the public-use datasets, and to explore the MPPS data using the public-use-datasets. The codebooks for these restricted use datasets are available for download on CLOSUP's website.

  14. Inability to face unexpected financial expenses by NUTS 2 region

    • ec.europa.eu
    Updated Oct 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eurostat (2025). Inability to face unexpected financial expenses by NUTS 2 region [Dataset]. http://doi.org/10.2908/ILC_MDES04_R
    Explore at:
    json, application/vnd.sdmx.data+csv;version=2.0.0, application/vnd.sdmx.data+xml;version=3.0.0, tsv, application/vnd.sdmx.data+csv;version=1.0.0, application/vnd.sdmx.genericdata+xml;version=2.1Available download formats
    Dataset updated
    Oct 10, 2025
    Dataset authored and provided by
    Eurostathttps://ec.europa.eu/eurostat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2021 - 2024
    Area covered
    Voreia Elláda, Koblenz, Zuid-Nederland, Åland, Länsi-Suomi, Emilia-Romagna, Spain, Makroregion północno-zachodni, Pays de la Loire, Noroeste, Severozápad
    Description

    The European Union Statistics on Income and Living Conditions (EU-SILC) collects timely and comparable multidimensional microdata on income, poverty, social exclusion and living conditions.

    The EU-SILC collection is a key instrument for providing information required by the European Semester ([1]) and the European Pillar of Social Rights, and the main source of data for microsimulation purposes and flash estimates of income distribution and poverty rates.

    AROPE remains crucial to monitor European social policies, especially to monitor the EU 2030 target on poverty and social exclusion. For more information, please consult EU social indicators.

    The EU-SILC instrument provides two types of data:

    • Cross-sectional data pertaining to a given time or a certain time period with variables on income, poverty, social exclusion and other living conditions.
    • Longitudinal data pertaining to individual-level changes over time, observed periodically over four‐or more year rotation scheme (Annex III (2) of 2019/1700).

    EU-SILC collects:

    • annual variables,
    • three-yearly modules,
    • six-yearly modules,
    • ad-hoc new policy needs modules,
    • optional variables.

    The variables collected are grouped by topic and detailed topic and transmitted to Eurostat in four main files (D-File, H-File, R-File and P-file).

    The domain ‘Income and Living Conditions’ covers the following topics: persons at risk of poverty or social exclusion, income inequality, income distribution and monetary poverty, living conditions, material deprivation, and EU-SILC ad-hoc modules, which are structured into collections of indicators on specific topics.

    In 2023, in addition to annual data, in EU-SILC were collected: the three yearly module on labour market and housing, the six yearly module on intergenerational transmission of advantages and disadvantages, housing difficulties, and the ad hoc subject on households energy efficiency.

    Starting from 2021 onwards, the EU quality reports use the structure of the Single Integrated Metadata Structure (SIMS).

    ([1]) The European Semester is the European Union’s framework for the coordination and surveillance of economic and social policies.

  15. d

    South African Social Attitudes Survey (SASAS) 2006: Combined data with...

    • demo-b2find.dkrz.de
    Updated Sep 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). South African Social Attitudes Survey (SASAS) 2006: Combined data with household weight - All provinces - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/d65e8780-c31c-5683-9a82-3d00c5e17fb1
    Explore at:
    Dataset updated
    Sep 20, 2025
    Area covered
    South Africa
    Description

    Description: The harmonised core module data are available in the combined dataset. The questions contained in the core modules of the two SASAS questionnaires for 2006 (demographics and core thematic issues) were asked of 7000 respondents, while the remaining rotating modules were asked of a half sample of approximately 3500 respondents each. The combined data set contains 5843 records and 157 variables. Topics included in the questionnaires are: democracy, identity, public services, moral issues, crime, voting, demographics and other classificatory variables. This version of the combined dataset should be used where analysis is to be performed at household level. Abstract: The primary objective of the South African Social Attitudes Survey (SASAS) is to design, develop and implement a conceptually and methodologically robust study of changing social attitudes and values in South Africa. In meeting this objective, the HSRC is carefully and consistently monitoring and providing insight into changes in attitudes among various socio-demographic groupings. SASAS is intended to provide a unique long-term account of the social fabric of modern South Africa, and of how its changing political and institutional structures interact over time with changing social attitudes and values. The survey has been designed to yield a national representative sample of adults aged 16 and older, using the Human Sciences Research Council's (HSRC) Master Sample, which was designed in 2002 and consists of 1000 primary sampling units (PSUs). These PSUs were drawn, with probability proportional to size from a pre-census 2001 list of 80780 enumerator areas (EAs). As the basis of the 2006 SASAS round of interviewing, a sub-sample of 500 EAs (PSUs) was drawn from the master sample. Three explicit stratification variables were used, namely province, geographic type and majority population group. The survey is conducted annually and the 2006 survey is the fourth wave in the series. To accommodate the wide variety of topics included in the survey, two questionnaires are administered simultaneously. Apart from the standard set of demographic and background variables, each version of the questionnaire contained a harmonised core module. The questions contained in the core modules of the two SASAS questionnaires (demographics and core thematic issues) were asked of 7000 respondents, while the remaining rotating modules were asked of a half sample of approximately 3500 respondents each. The core module remains constant for with the aim of monitoring change and continuity in a variety of socio-economic and socio-political variables. In addition, a number of themes are accommodated in rotation. The rotating element of the survey consists of two or more topic-specific modules in each round of interviewing and is directed at measuring a range of policy and academic concerns and issues that require more detailed examination at a specific point in time than the multi-topic core module would permit. Topics included in the questionnaires are: democracy, national identity, public services, moral issues, crime, voting, demographics and other classificatory variables. Rotating modules are: media and communication, health status and behavior, social exclusion, tourism and leisure, intergroup relations, Soccer World Cup, work and welfare, social exclusion, democracy part 2, water services and poverty. International Social Survey Programme. (ISSP web page:www.issp.org/) The International Social Survey Programme (ISSP) is run by a group of research organisations, each of which undertakes to field annually an agreed module of questions on a chosen topic area. SASAS 2003 represents the formalisation of South Africa's inclusion in the ISSP, the intention being to include the module in one of the SASAS questionnaires in each round of interviewing. Each module is chosen for repetition at intervals to allow comparisons both between countries (membership currently stands at 48) and over time. In 2006, the chosen subject was the role of government, and the module was carried in version two of the questionnaire (Qs.174-229.This data can be accessed through the ISSP data portal (see link above).

  16. u

    Synthetic Administrative Data: Census 1991, 2023

    • datacatalogue.ukdataservice.ac.uk
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shlomo, N, University of Manchester; Kim, M, University of Manchester (2024). Synthetic Administrative Data: Census 1991, 2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-856310
    Explore at:
    Dataset updated
    Feb 21, 2024
    Authors
    Shlomo, N, University of Manchester; Kim, M, University of Manchester
    Area covered
    United Kingdom
    Description

    We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.

    National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.

  17. d

    Data from: Research on potential disruptive technology identification based...

    • search.dataone.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. L. Ding (2025). Research on potential disruptive technology identification based on technology network [Dataset]. http://doi.org/10.5061/dryad.z612jm6j8
    Explore at:
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    M. L. Ding
    Time period covered
    Jan 1, 2023
    Description

    Three evident and meaningful characteristics of disruptive technology are the zeroing effect that causes sustaining technology useless for its remarkable and unprecedented progress, reshaping the landscape of technology and economy, and leading the future mainstream of technology system, all of which have profound impacts and positive influences. The identification of disruptive technology is a universally difficult task. Therefore, the paper aims to enhance the technical relevance of potential disruptive technology identification results and improve the granularity and effectiveness of potential disruptive technology identification topics. According to the life cycle theory, dividing the time stage, then constructing and analyzing the dynamic of technology networks to identify potential disruptive technology. Thereby, using the LDA topic model further to clarify the topic content of potential disruptive technologies. This paper takes the large civil UAVs as an example to prove the feas..., Through the analysis of the technology life cycle, the division of the patents, the construction of the technology network, the identification of nodes leaping, the clustering of technical topics, we aim to identify potential disruptive technology. Â

    Procedures:

    Knowledge flow: being familiar with the technical background knowledge in the field of large civil UAVs, and accomplishing the technical decomposition. Invention patents: analyzing the technology life cycle by the loget lab to separate the invention patents into four parts. According to each part, constructing the IPC technical network and identifying the leapfrogging and diffusible nodes. Technical topics: making use of the LDA model to cluster and explain the broad and various content of the inventions.

    Â

    Testing: Dividing the inventions of the embryonic stage into two groups and examining them by means of the Mann-Whitney test. Finally, the result shows the huge differences in the patent value, sustaining influence, and c..., , This README file was generated on 2023-11-25 by Mingli Ding.

    GENERAL INFORMATION

    1. Title of Dataset: technical network in the field of large civilian UAVs

    2. Author Information

    Investigators Contact Information Name: Mingli Ding; Wangke Yu; Ran Li; Zhenzhen Wang; Jianing Li Institution: Jingdezhen Ceramic University Address: Jingdezhen, Jiangxi, China Email:

    1. Date of data collection:2005-2018

    DATA & FILE OVERVIEW

    1. File List:

    A)patent (2005-2008).csv

    B)patents (2009-2012).csv

    C)patents (2013-2015).csv

    D)patents (2016-2018).csv

    E)technical network (2005-2008).csv

    F)technical network (2009-2012).csv

    G)technical networks (2013-2015).csv

    H)technical network (2016-2018).csv

    DATA-SPECIFIC INFORMATION FOR: patent (2005-2008).csv

    1. Number of variables: 2

    2. Number of cases/rows: 234

    3. Variable List:

      • source: the source of the edges in the technical network
      • target: the target of the edges in the technical network

    4. Specialized fo...

  18. f

    Data from: Variable description.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan (2023). Variable description. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001031498
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan
    Description

    Studies in the past have examined asthma prevalence and the associated risk factors in the United States using data from national surveys. However, the findings of these studies may not be relevant to specific states because of the different environmental and socioeconomic factors that vary across regions. The 2019 Behavioral Risk Factor Surveillance System (BRFSS) showed that Michigan had higher asthma prevalence rates than the national average. In this regard, we employ various modern machine learning techniques to predict asthma and identify risk factors associated with asthma among Michigan adults using the 2019 BRFSS data. After data cleaning, a sample of 10,337 individuals was selected for analysis, out of which 1,118 individuals (10.8%) reported having asthma during the survey period. Typical machine learning techniques often perform poorly due to imbalanced data issues. To address this challenge, we employed two synthetic data generation techniques, namely the Random Over-Sampling Examples (ROSE) and Synthetic Minority Over-Sampling Technique (SMOTE) and compared their performances. The overall performance of machine learning algorithms was improved using both methods, with ROSE performing better than SMOTE. Among the ROSE-adjusted models, we found that logistic regression, partial least squares, gradient boosting, LASSO, and elastic net had comparable performance, with sensitivity at around 50% and area under the curve (AUC) at around 63%. Due to ease of interpretability, logistic regression is chosen for further exploration of risk factors. Presence of chronic obstructive pulmonary disease, lower income, female sex, financial barrier to see a doctor due to cost, taken flu shot/spray in the past 12 months, 18–24 age group, Black, non-Hispanic group, and presence of diabetes are identified as asthma risk factors. This study demonstrates the potentiality of machine learning coupled with imbalanced data modeling approaches for predicting asthma from a large survey dataset. We conclude that the findings could guide early screening of at-risk asthma patients and designing appropriate interventions to improve care practices.

  19. Data from: National Science Foundation Surveys of Public Attitudes Toward...

    • icpsr.umich.edu
    ascii, sas, spss
    Updated Jan 18, 2006
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miller, Jon D.; Kimmel, Linda (2006). National Science Foundation Surveys of Public Attitudes Toward and Understanding of Science and Technology, 1979-2001: [United States] [Dataset]. http://doi.org/10.3886/ICPSR04029.v1
    Explore at:
    spss, ascii, sasAvailable download formats
    Dataset updated
    Jan 18, 2006
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Miller, Jon D.; Kimmel, Linda
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/4029/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/4029/terms

    Time period covered
    1979 - 2001
    Area covered
    United States
    Description

    The National Science Foundation (NSF) Surveys of Public Attitudes monitored the general public's attitudes toward and interest in science and technology. In addition, the survey assessed levels of literacy and understanding of scientific and environmental concepts and constructs, how scientific knowledge and information were acquired, attentiveness to public policy issues, and computer access and usage. Since 1979, the survey was administered at regular intervals (occurring every two or three years), producing 11 cross-sectional surveys through 2001. Data for Part 1 (Survey of Public Attitudes Multiple Wave Data) were comprised of the survey questionnaire items asked most often throughout the 22-year survey series and account for approximately 70 percent of the original questions asked. Data for Part 2, General Social Survey Subsample Data, combine the 1983-1999 Survey of Public Attitudes data with a subsample from the 2002 General Social Survey (GSS) (GENERAL SOCIAL SURVEYS, 1972-2002: [CUMULATIVE FILE] [ICPSR 3728]) and focus solely on levels of education and computer access and usage. Variables for Part 1 include the respondents' interest in new scientific or medical discoveries and inventions, space exploration, military and defense policies, whether they voted in a recent election, if they had ever contacted an elected or public official about topics regarding science, energy, defense, civil rights, foreign policy, or general economics, and how they felt about government spending on scientific research. Respondents were asked how they received information concerning science or news (e.g., via newspapers, magazines, or television), what types of television programming they watched, and what kind of magazines they read. Respondents were asked a series of questions to assess their understanding of scientific concepts like DNA, probability, and experimental methods. Respondents were also asked if they agreed with statements concerning science and technology and how they affect everyday living. Respondents were further asked a series of true and false questions regarding science-based statements (e.g., the center of the Earth is hot, all radioactivity is manmade, electrons are smaller than atoms, the Earth moves around the sun, humans and dinosaurs co-existed, and human beings developed from earlier species of animals). Variables for Part 2 include highest level of math attained in high school, whether the respondent had a postsecondary degree, field of highest degree, number of science-based college courses taken, major in college, household ownership of a computer, access to the World Wide Web, number of hours spent on a computer at home or at work, and topics searched for via the Internet. Demographic variables for Parts 1 and 2 include gender, race, age, marital status, number of people in household, level of education, and occupation.

  20. Social Attitudes Survey 2005 - South Africa

    • microdata.worldbank.org
    • catalog.ihsn.org
    • +1more
    Updated May 7, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Human Sciences Research Council (HSRC) (2014). Social Attitudes Survey 2005 - South Africa [Dataset]. https://microdata.worldbank.org/index.php/catalog/1594
    Explore at:
    Dataset updated
    May 7, 2014
    Dataset provided by
    Human Sciences Research Councilhttps://hsrc.ac.za/
    Authors
    Human Sciences Research Council (HSRC)
    Time period covered
    2005
    Area covered
    South Africa
    Description

    Abstract

    The primary objective of SASAS is to design, develop and implement a conceptually and methodologically robust study of changing social attitudes and values in South Africa to be able to carefully and consistently monitor and explain changes in attitudes amongst various socio-demographic groupings. The SASAS explores a wide range of value changes, including the distribution and shape of racial attitudes and aspirations, attitudes towards democratic and constitutional issues, and the redistribution of resources and power. Moreover, there is also an explicit interest in mapping changing attitudes towards some of the moral issues that confront and are fiercely debated in South Africa, such as gender issues, AIDS, crime and punishment, governance, and service delivery. The SASAS is intended to provide a unique long-term account of the social fabric of modern South Africa, and of how its changing political and institutional structures interact over time with changing social attitudes and values.

    Geographic coverage

    National coverage

    Analysis unit

    The units of analysis in the study are households and individuals

    Universe

    The population under investigation includes adults aged 16 and older in private households in South Africa

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Sampling Design The South African Social Attitudes Survey has been designed to yield a representative sample of adults aged 16 and older. The sampling frame for the survey is the Human Sciences Research Council’s (HSRC) Master Sample, which was designed in 2002 and consists of 1 000 primary sampling units (PSUs). The 2001 population census enumerator areas (EAs) were used as PSUs. These PSUs were drawn, with probability proportional to size, from a pre-census 2001 list of EAs provided by Statistics South Africa.

    The Master Sample excludes special institutions (such as hospitals, military camps, old age homes, school and university hostels), recreational areas, industrial areas and vacant EAs. It therefore focuses on dwelling units or visiting points as secondary sampling units, whic have been defined as ‘separate (non-vacant) residential stands, addresses, structures, flats, homesteads, etc.’.

    As the basis of the 2005 SASAS round of interviewing, a sub-sample of 500 PSUs was drawn from the HSRC’s Master Sample. Three explicit stratification variables were used, namely province, geographic type and majority population group.

    Within each stratum, the allocated number of PSUs was drawn using proportional to size probability sampling. In each of these drawn PSUs, two clusters of 7 dwelling units each were drawn. These 14 dwelling units in each drawn PSU were systematically grouped into two subsamples of seven, to give the two SASAS samples.

    Number of units: Questionnaire 1: 2 497 cases realised from 3 500 addresses; questionnaire 2: 2 483 cases realised from 3 500 addresses; combined : 4980 cases

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    To accommodate the wide variety of topics that was included in the 2005 survey, two questionnaires were administered simultaneously. Apart from the standard set of demographic and background variables, each version of the questionnaire contained a harmonised core module that remains constant from round to round, with the aim of monitoring change and continuity in a variety of socio-economic and socio-political variables. In addition, a number of themes are accommodated on a rotational basis. This rotating element of the survey consists of two or more topic-specific modules in each round of interviewing and is directed at measuring a range of policy and academic concerns and issues that require more detailed examination at a specific point in time than the multi-topic core module would permit.

    Questions for the core module were asked of both samples (3 500 respondents each – 7 000) of which 5 734 realised.

    The ISSP module: The International Social Survey Programme (ISSP) is run by a group of research organisations, each of which undertakes to field annually an agreed module of questions on a chosen topic area. SASAS 2003 represents the formalisation of South Africa's inclusion in the ISSP, the intention being to include the module in one of the SASAS questionnaires in each round of interviewing. Each module is chosen for repetition at intervals to allow comparisons both between countries (membership currently stands at 45) and over time. In 2005, the chosen subject was work orientation, and the module was carried in version 2 of the questionnaire (Qs.98-169).

    The standard questionnaires dealt with democracy, identity, public services, social values, crime, voting, demographics, families and family authority The rotating modules in the 2005 survey covered: Questionnaire 1: Poverty and social exclusion, family life Questionnaire 2: ISSP module (work orientation), soccer World Cup, democracy part 2

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas (2023). Data Fission: Splitting a Single Data Point [Dataset]. http://doi.org/10.6084/m9.figshare.24328745.v2
Organization logo

Data from: Data Fission: Splitting a Single Data Point

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Dec 14, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.

Search
Clear search
Close search
Google apps
Main menu