77 datasets found

Data from: Data Fission: Splitting a Single Data Point
tandf.figshare.com
txt
Updated Dec 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas (2023). Data Fission: Splitting a Single Data Point [Dataset]. http://doi.org/10.6084/m9.figshare.24328745.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24328745.v2
Dataset updated
Dec 14, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.
2023 Census totals by topic for families and extended families by...
datafinder.stats.govt.nz
csv, dwg, geodatabase +6
Updated Nov 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats NZ (2024). 2023 Census totals by topic for families and extended families by statistical area 2 [Dataset]. https://datafinder.stats.govt.nz/layer/120891-2023-census-totals-by-topic-for-families-and-extended-families-by-statistical-area-2/
Explore at:
mapinfo tab, geopackage / sqlite, shapefile, kml, csv, geodatabase, pdf, mapinfo mif, dwgAvailable download formats
Dataset updated
Nov 24, 2024
Dataset provided by
Statistics New Zealandhttp://www.stats.govt.nz/
Authors
Stats NZ
License
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Area covered
Description
Dataset contains counts and measures for families and extended families from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.

The variables included in this dataset are for families and extended families in households in occupied private dwellings:

Count of families

Family type

Number of people in family

Average number of people in family

Total family income

Median ($) total family income

Count of extended families

Extended family type

Total extended family income

Median ($) total extended family income.

Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

Footnotes

Geographical boundaries

Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

Caution using time series

Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

About the 2023 Census dataset

For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

Data quality

The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

Concept descriptions and quality ratings

Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

Using data for good

Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

Confidentiality

The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

Measures

Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.

Percentages

To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.

Symbol

-997 Not available

-999 Confidential

Inconsistencies in definitions

Please note that there may be differences in definitions between census classifications and those used for other data collections.
f
Living Standards Survey, Wave 3 (extension), 2007-2008 - Timor-Leste
microdata.fao.org
Updated Nov 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Statistics Directorate (2022). Living Standards Survey, Wave 3 (extension), 2007-2008 - Timor-Leste [Dataset]. https://microdata.fao.org/index.php/catalog/1507
Explore at:
Dataset updated
Nov 8, 2022
Dataset authored and provided by
National Statistics Directorate
Time period covered
2007 - 2008
Area covered
Timor-Leste
Description
Abstract

In 2007-2008 a multi-topic household survey, the Timor Leste Living Standards Survey (LSS-2) was conducted in East Timor with the main objectives of developing a system of poverty monitoring and supporting poverty reduction, and to monitor human development indicators and progress toward the Millennium Development Goals. The LSS-3 extension survey was designed to re-visit one third of the households interviewed under the LSS-2 to explore different facets of household welfare and behaviour in the country, while also being able to make use of information collected in the LSS-2 survey for analytic purposes. The four new topics investigated in the extension survey are:

Risk and Vulnerability: This section is designed to help us understand the dimensions and sources of household-level vulnerability to uninsured risks in Timor Leste, and the efficacy and welfare effects of various risk-management strategies (prevention, mitigation, coping) and mechanisms (private as well as public, formal as well as informal) households do (or do not) have access to. The work in Timor Leste is part of a program of analytic work and policy dialogue throughout the EAP region, more information on which can be found on the World Bank website.

Land Degradation and Poverty: This section of the questionnaire is designed to identify proximate causes of deforestation through land use patterns and links with poverty; understand strengths and failures of common land resource management institutions (property rights, enforcement); understand the impact of the Siam Weed problem on household welfare.

Justice for Poor: The Justice for the Poor/Access to Justice (J4P/A2J) module of the survey will serve mainly as an initial diagnostic for project development in the country. The topics we would be interested in covering would be Dispute Processing/Resolution; Social Legal Norms and Perceptions of Efficiency in Government (Local, Sub-District, District and National level).

Access to Financial Services: The financial service work has the following two objectives: (i) to collect data on access to and use financial services (savings and credit), both formal and informal, and (ii) assess the quality of information on access to financial services obtained from head of households vs. from all adults - i.e. is there a bias introduced by not asking all household members, do the characteristics of the head or the household affect this (gender, age, nuclear family, urban, education levels, wealth, etc.).

Geographic coverage

National coverage

Analysis unit

Households

Kind of data

Sample survey data [ssd]

Sampling procedure

SAMPLE DESIGN FOR THE 2008 EXTENSION SURVEY

Sampling for the LSS-3 Extension survey was a sub-sample of the original LSS-“ sample. The LSS-2 field work was divided into 52 "weeks", with each week being a random subset of the total sample. The sub-sample was chosen by randomly selecting 19 weeks from the original field work schedule. Each week contained seven Primary Sampling Units (PSUs) for a total of 133 PSUs. In each PSU the teams were to interview 12 of the original 15 households, with the remaining three to serve as replacements. The total nominal sample size was thus 1596.

Additional interviews: Following the collection and initial analysis of the data, it was determined that data from one district, Manatuto, and partially from another district, Oecussi, were of insufficient quality in certain modules. Therefore, it was decided to repeat the survey in another 25 PSUs of these two districts - six in Manatuto, and 19 in Oecussi. The additional PSUs chosen were randomly selected within the two districts from the remaining non-panel PSUs in the original LSS-2 sample.

Mode of data collection

Face-to-face [f2f]

Cleaning operations

DATA CLEANING

The LSS-3 had a significant number of responses in which the response is "other". In general, if the response clear fit into a pre-coded response category, it was recoded into that category during the cleaning and compilation process. Some responses where additional information was provided were not recoded even though they clearly fit into pre-coded categories. For example, agriculture project" would be recoded into the "agriculture" category, while "community garden" would not. Data users can either use the additional information, or re-code into categories as they see fit. Potential Data Quality Issues in 2008 Extension survey

Data appraisal

Potential Data Quality Issues in 2008 Extension survey

Agriculture: Similarly, to the individual roster of the previous section, the plots listed in the previous survey are listed on the pre-printed cover page and all changes noted. The agricultural section, similarly, to the other sections, suffers from problems with open-ended questions. This is particularly the case for the question asking what community restrictions are placed on the clearing of forest land (section 2d). The translation from the original question was vague (using the Tetun word for "boundary" for "restriction,") and therefore many of the responses relate to physical boundaries on the land, such as stone walls and tree lines. Additionally, the translation of all answers from Tetun into English is imperfect, and those wishing to use this information for analytical purposes are advised to also refer to the original Tetun. Analysts should be careful in using the data from the open ended questions because of translation problems. Also, it was noted during the training and field work that many interviewers had significant difficulties understanding definitions with some of the land management and investment questions. In general, however, all agricultural data may be used for analysis, sampling weights w3.

Finance: It should be noted that the quality of the data for the finance experiment (comparing the knowledge of the household head to that of other household members) was not sufficient for the experiment to be deemed a success. Subsequent spot-checking revealed that in many cases, interviewers asked the household head about the financial activities of various household members instead of asking them directly. Therefore, this data should only be used to measure the access to finance at the household level. The finance sections were not repeated during the additional interviews in the replacement PSUs. Sampling weights w1 should be used when doing any analysis with this data.

Shocks and Vulnerability: It was determined following the initial round of data collection that the shocks and vulnerability module had some issues with uneven interview quality. Two reasons were listed as potential causes of the data quality issues: (1) fundamental inability to adequately translate both the word and concept of a "shock" into the Timorese context, and (2) incomplete / questionable responses to the health shock questions in particular. Analysis for health shocks should drop the "questionable" households and use the "re-interview" households, sampling weights w2.

Justice for the Poor: Similar to the shocks and vulnerability module, the justice module included a long series of follow up questions if the household indicated having experienced a dispute during the recall period. Again, the number of disputes experienced by the household seemed extremely low compared to expectations. This was particularly a problem with the Manatuto district in which no disputes were recorded during the first set of TLSLS2-X interviews. Analysis for the disputes section of the justice module should drop the "questionable" households and use the "re-interview" households, sampling weights w2. The justice model also has a number of instances in which the specifications for "other" were not recorded. Every effort was made to ensure this data was as complete as possible, but gaps do remain. Also, data users should use caution when using the imputed rank variable in section 5D. The rank in terms of importance was not explicitly captured in the data entry software, and the rankings therefore had to be imputed from the order they were listed in the original data entry. Inconsistencies may exist in this variable.
d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Top 10 words of two topics with highest absolute values of regression...
plos.figshare.com
xls
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weiran Xu; Koji Eguchi (2023). Top 10 words of two topics with highest absolute values of regression coefficients and the topic coherence measured in NPMI on the training set when K = 15. [Dataset]. http://doi.org/10.1371/journal.pone.0277104.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277104.t005
Dataset updated
Jun 19, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Weiran Xu; Koji Eguchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 10 words of two topics with highest absolute values of regression coefficients and the topic coherence measured in NPMI on the training set when K = 15.
2023 Census totals by topic for individuals by statistical area 2 – part 1
datafinder.stats.govt.nz
csv, dwg, geodatabase +6
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats NZ (2024). 2023 Census totals by topic for individuals by statistical area 2 – part 1 [Dataset]. https://datafinder.stats.govt.nz/layer/120897-2023-census-totals-by-topic-for-individuals-by-statistical-area-2-part-1/
Explore at:
mapinfo tab, mapinfo mif, csv, dwg, pdf, geodatabase, shapefile, kml, geopackage / sqliteAvailable download formats
Dataset updated
Nov 25, 2024
Dataset provided by
Statistics New Zealandhttp://www.stats.govt.nz/
Authors
Stats NZ
License
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Area covered
Description
Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.

The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated).

The variables for part 1 of the dataset are:

Census usually resident population count

Census night population count

Age (5-year groups)

Age (life cycle groups)

Median age

Birthplace (NZ born/overseas born)

Birthplace (broad geographic areas)

Ethnicity (total responses) for level 1 and ‘Other Ethnicity’ grouped by ‘New Zealander’ and ‘Other Ethnicity nec’

Māori descent indicator

Languages spoken (total responses)

Official language indicator

Gender

Cisgender and transgender status – census usually resident population count aged 15 years and over

Sex at birth

Rainbow/LGBTIQ+ indicator for the census usually resident population count aged 15 years and over

Sexual identity for the census usually resident population count aged 15 years and over

Legally registered relationship status for the census usually resident population count aged 15 years and over

Partnership status in current relationship for the census usually resident population count aged 15 years and over

Number of children born for the sex at birth female census usually resident population count aged 15 years and over

Average number of children born for the sex at birth female census usually resident population count aged 15 years and over

Religious affiliation (total responses)

Cigarette smoking behaviour for the census usually resident population count aged 15 years and over

Disability indicator for the census usually resident population count aged 5 years and over

Difficulty communicating for the census usually resident population count aged 5 years and over

Difficulty hearing for the census usually resident population count aged 5 years and over

Difficulty remembering or concentrating for the census usually resident population count aged 5 years and over

Difficulty seeing for the census usually resident population count aged 5 years and over

Difficulty walking for the census usually resident population count aged 5 years and over

Difficulty washing for the census usually resident population count aged 5 years and over.

Download lookup file for part 1 from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

Footnotes

Te Whata

Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.

Geographical boundaries

Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

Subnational census usually resident population

The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.

Population counts

Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.

Caution using time series

Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

Study participation time series

In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.

About the 2023 Census dataset

For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

Data quality

The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

Concept descriptions and quality ratings

Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

Disability indicator

This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.

Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.

Using data for good

Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

Confidentiality

The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

Measures

Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.

Percentages

To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.

Symbol

-997 Not available

-999 Confidential

Inconsistencies in definitions

Please note that there may be differences in definitions between census classifications and those used for other data collections.
2023 Census totals by topic for individuals by statistical area 2 – part 2
datafinder.stats.govt.nz
csv, dwg, geodatabase +6
Updated Nov 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats NZ (2024). 2023 Census totals by topic for individuals by statistical area 2 – part 2 [Dataset]. https://datafinder.stats.govt.nz/layer/120898-2023-census-totals-by-topic-for-individuals-by-statistical-area-2-part-2/
Explore at:
dwg, mapinfo tab, pdf, mapinfo mif, geodatabase, shapefile, kml, geopackage / sqlite, csvAvailable download formats
Dataset updated
Nov 25, 2024
Dataset provided by
Statistics New Zealandhttp://www.stats.govt.nz/
Authors
Stats NZ
License
https://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Area covered
Description
Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.

The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification.

The variables for part 2 of the dataset are:

Individual home ownership for the census usually resident population count aged 15 years and over

Usual residence 1 year ago indicator

Usual residence 5 years ago indicator

Years at usual residence

Average years at usual residence

Years since arrival in New Zealand for the overseas-born census usually resident population count

Average years since arrival in New Zealand for the overseas-born census usually resident population count

Study participation

Main means of travel to education, by usual residence address for the census usually resident population who are studying

Main means of travel to education, by education address for the census usually resident population who are studying

Highest qualification for the census usually resident population count aged 15 years and over

Post-school qualification in New Zealand indicator for the census usually resident population count aged 15 years and over

Highest secondary school qualification for the census usually resident population count aged 15 years and over

Post-school qualification level of attainment for the census usually resident population count aged 15 years and over

Sources of personal income (total responses) for the census usually resident population count aged 15 years and over

Total personal income for the census usually resident population count aged 15 years and over

Median ($) total personal income for the census usually resident population count aged 15 years and over

Work and labour force status for the census usually resident population count aged 15 years and over

Job search methods (total responses) for the unemployed census usually resident population count aged 15 years and over

Status in employment for the employed census usually resident population count aged 15 years and over

Unpaid activities (total responses) for the census usually resident population count aged 15 years and over

Hours worked in employment per week for the employed census usually resident population count aged 15 years and over

Average hours worked in employment per week for the employed census usually resident population count aged 15 years and over

Industry, by usual residence address for the employed census usually resident population count aged 15 years and over

Industry, by workplace address for the employed census usually resident population count aged 15 years and over

Occupation, by usual residence address for the employed census usually resident population count aged 15 years and over

Occupation, by workplace address for the employed census usually resident population count aged 15 years and over

Main means of travel to work, by usual residence address for the employed census usually resident population count aged 15 years and over

Main means of travel to work, by workplace address for the employed census usually resident population count aged 15 years and over

Sector of ownership for the employed census usually resident population count aged 15 years and over

Individual unit data source.

Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.

Footnotes

Te Whata

Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.

Geographical boundaries

Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.

Subnational census usually resident population

The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.

Population counts

Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.

Caution using time series

Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).

Study participation time series

In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.

About the 2023 Census dataset

For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.

Data quality

The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.

Concept descriptions and quality ratings

Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.

Disability indicator

This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.

Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.

Using data for good

Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.

Confidentiality

The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.

Measures

Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures
Top 10 words of two topics with highest absolute values of regression...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weiran Xu; Koji Eguchi (2023). Top 10 words of two topics with highest absolute values of regression coefficients. [Dataset]. http://doi.org/10.1371/journal.pone.0277104.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277104.t003
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Weiran Xu; Koji Eguchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 10 words of two topics with highest absolute values of regression coefficients.
Baseline multiple linear regression model with end fitness as the response...
plos.figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs (2023). Baseline multiple linear regression model with end fitness as the response variable, showing the calculated variable inflation factors (VIFs). [Dataset]. http://doi.org/10.1371/journal.pone.0211776.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0211776.t004
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Dan Weaving; Ben Jones; Matt Ireton; Sarah Whitehead; Kevin Till; Clive B. Beggs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Baseline multiple linear regression model with end fitness as the response variable, showing the calculated variable inflation factors (VIFs).
Inability to keep home adequately warm by NUTS 2 region
ec.europa.eu
Updated Mar 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eurostat (2025). Inability to keep home adequately warm by NUTS 2 region [Dataset]. http://doi.org/10.2908/ILC_MDES01_R
Explore at:
application/vnd.sdmx.genericdata+xml;version=2.1, application/vnd.sdmx.data+csv;version=2.0.0, application/vnd.sdmx.data+csv;version=1.0.0, tsv, json, application/vnd.sdmx.data+xml;version=3.0.0Available download formats
Unique identifier
https://doi.org/10.2908/ILC_MDES01_R
Dataset updated
Mar 22, 2025
Dataset authored and provided by
Eurostathttps://ec.europa.eu/eurostat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2021 - 2024
Area covered
Molise, Helsinki-Uusimaa, Stockholm, Leipzig, Principado de Asturias, Zealand, Makroregion centralny, Zachodniopomorskie, Thüringen, Budapest
Description
The European Union Statistics on Income and Living Conditions (EU-SILC) collects timely and comparable multidimensional microdata on income, poverty, social exclusion and living conditions.

The EU-SILC collection is a key instrument for providing information required by the European Semester (_[1]) and the European Pillar of Social Rights, and the main source of data for microsimulation purposes and flash estimates of income distribution and poverty rates.

AROPE remains crucial to monitor European social policies, especially to monitor the EU 2030 target on poverty and social exclusion. For more information, please consult EU social indicators.

The EU-SILC instrument provides two types of data:

Cross-sectional data pertaining to a given time or a certain time period with variables on income, poverty, social exclusion and other living conditions.

Longitudinal data pertaining to individual-level changes over time, observed periodically over four‐or more year rotation scheme (Annex III (2) of 2019/1700).

EU-SILC collects:

annual variables,

three-yearly modules,

six-yearly modules,

ad-hoc new policy needs modules,

optional variables.

The variables collected are grouped by topic and detailed topic and transmitted to Eurostat in four main files (D-File, H-File, R-File and P-file).

The domain ‘Income and Living Conditions’ covers the following topics: persons at risk of poverty or social exclusion, income inequality, income distribution and monetary poverty, living conditions, material deprivation, and EU-SILC ad-hoc modules, which are structured into collections of indicators on specific topics.

In 2023, in addition to annual data, in EU-SILC were collected: the three yearly module on labour market and housing, the six yearly module on intergenerational transmission of advantages and disadvantages, housing difficulties, and the ad hoc subject on households energy efficiency.

Starting from 2021 onwards, the EU quality reports use the structure of the Single Integrated Metadata Structure (SIMS).

_{([1]) The European Semester is the European Union’s framework for the coordination and surveillance of economic and social policies.}
Descriptive statistics of the variables.
plos.figshare.com
xls
Updated Jan 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baodong Chen; Jing Li; Jiayi Zhang (2025). Descriptive statistics of the variables. [Dataset]. http://doi.org/10.1371/journal.pone.0317892.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317892.t002
Dataset updated
Jan 31, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Baodong Chen; Jing Li; Jiayi Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Corporate financialization is a growing concern in China, and its impact on the main business of real enterprises is a crucial topic. This paper uses data from all A-share non-financial listed companies in China between 2013 and 2022 to establish a dynamic panel threshold model and test the effect of corporate financialization on enterprise performance. The empirical results indicate a threshold effect between the two variables, corporate financialization has both positive and negative effects on main business performance, with a threshold of 5.82%. Additionally, significant heterogeneous results are found for the nature of ownership, asset maturity, industry and regional distribution.
MSE and sample standard deviation on test set of movie rating score...
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weiran Xu; Koji Eguchi (2023). MSE and sample standard deviation on test set of movie rating score prediction when K = 15. [Dataset]. http://doi.org/10.1371/journal.pone.0277104.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0277104.t006
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Weiran Xu; Koji Eguchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MSE and sample standard deviation on test set of movie rating score prediction when K = 15.
Michigan Public Policy Survey, 2009 - 2016
icpsr.umich.edu
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Michigan. Center for Local, State, and Urban Policy (2024). Michigan Public Policy Survey, 2009 - 2016 [Dataset]. http://doi.org/10.3886/ICPSR39057.v1
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR39057.v1
Dataset updated
Feb 28, 2024
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
University of Michigan. Center for Local, State, and Urban Policy
License
https://www.icpsr.umich.edu/web/ICPSR/studies/39057/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39057/terms
Time period covered
Apr 23, 2009 - Dec 13, 2016
Area covered
Michigan
Description
The Michigan Public Policy Survey (MPPS) is a program of state-wide surveys of local government leaders in Michigan. The MPPS is designed to fill an important information gap in the policymaking process. While there are ongoing surveys of the business community and of the citizens of Michigan, before the MPPS there were no ongoing surveys of local government officials that were representative of all general purpose local governments in the state. Therefore, while we knew the policy priorities and views of the state's businesses and citizens, we knew very little about the views of the local officials who are so important to the economies and community life throughout Michigan. The MPPS was launched in 2009 by the Center for Local, State, and Urban Policy (CLOSUP) at the University of Michigan and is conducted in partnership with the Michigan Association of Counties, Michigan Municipal League, and Michigan Townships Association. The associations provide CLOSUP with contact information for the survey's respondents, and consult on survey topics. CLOSUP makes all decisions on survey design, data analysis, and reporting, and receives no funding support from the associations. The surveys investigate local officials' opinions and perspectives on a variety of important public policy issues and solicit factual information about their localities relevant to policymaking. Over time, the program has covered issues such as fiscal, budgetary and operational policy, fiscal health, public sector compensation, workforce development, local-state governmental relations, intergovernmental collaboration, economic development strategies and initiatives such as placemaking and economic gardening, the role of local government in environmental sustainability, energy topics such as hydraulic fracturing ("fracking") and wind power, trust in government, views on state policymaker performance, opinions on the impacts of the Federal Stimulus Program (ARRA), and more. The program will investigate many other issues relevant to local and state policy in the future. A searchable database of every question the MPPS has asked is available on CLOSUP's website. Results of MPPS surveys are currently available as reports, and via online data tables. The MPPS datasets are being released in two forms: public-use datasets and restricted-use datasets. Unlike the public-use datasets, the restricted-use datasets represent full MPPS survey waves, and include all of the survey questions from a wave. Restricted-use datasets also allow for multiple waves to be linked together for longitudinal analysis. The MPPS staff do still modify these restricted-use datasets to remove jurisdiction and respondent identifiers and to recode other variables in order to protect confidentiality. However, it is theoretically possible that a researcher might be able, in some rare cases, to use enough variables from a full dataset to identify a unique jurisdiction, so access to these datasets is restricted and approved on a case-by-case basis. CLOSUP encourages researchers interested in the MPPS to review the codebooks included in this data collection to see the full list of variables including those not found in the public-use datasets, and to explore the MPPS data using the public-use-datasets. The codebooks for these restricted use datasets are available for download on CLOSUP's website.
Inability to face unexpected financial expenses by NUTS 2 region
ec.europa.eu
Updated Oct 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eurostat (2025). Inability to face unexpected financial expenses by NUTS 2 region [Dataset]. http://doi.org/10.2908/ILC_MDES04_R
Explore at:
json, application/vnd.sdmx.data+csv;version=2.0.0, application/vnd.sdmx.data+xml;version=3.0.0, tsv, application/vnd.sdmx.data+csv;version=1.0.0, application/vnd.sdmx.genericdata+xml;version=2.1Available download formats
Unique identifier
https://doi.org/10.2908/ILC_MDES04_R
Dataset updated
Oct 10, 2025
Dataset authored and provided by
Eurostathttps://ec.europa.eu/eurostat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2021 - 2024
Area covered
Voreia Elláda, Koblenz, Zuid-Nederland, Åland, Länsi-Suomi, Emilia-Romagna, Spain, Makroregion północno-zachodni, Pays de la Loire, Noroeste, Severozápad
Description
The European Union Statistics on Income and Living Conditions (EU-SILC) collects timely and comparable multidimensional microdata on income, poverty, social exclusion and living conditions.

The EU-SILC collection is a key instrument for providing information required by the European Semester (_[1]) and the European Pillar of Social Rights, and the main source of data for microsimulation purposes and flash estimates of income distribution and poverty rates.

AROPE remains crucial to monitor European social policies, especially to monitor the EU 2030 target on poverty and social exclusion. For more information, please consult EU social indicators.

The EU-SILC instrument provides two types of data:

Cross-sectional data pertaining to a given time or a certain time period with variables on income, poverty, social exclusion and other living conditions.

Longitudinal data pertaining to individual-level changes over time, observed periodically over four‐or more year rotation scheme (Annex III (2) of 2019/1700).

EU-SILC collects:

annual variables,

three-yearly modules,

six-yearly modules,

ad-hoc new policy needs modules,

optional variables.

The variables collected are grouped by topic and detailed topic and transmitted to Eurostat in four main files (D-File, H-File, R-File and P-file).

The domain ‘Income and Living Conditions’ covers the following topics: persons at risk of poverty or social exclusion, income inequality, income distribution and monetary poverty, living conditions, material deprivation, and EU-SILC ad-hoc modules, which are structured into collections of indicators on specific topics.

In 2023, in addition to annual data, in EU-SILC were collected: the three yearly module on labour market and housing, the six yearly module on intergenerational transmission of advantages and disadvantages, housing difficulties, and the ad hoc subject on households energy efficiency.

Starting from 2021 onwards, the EU quality reports use the structure of the Single Integrated Metadata Structure (SIMS).

_{([1]) The European Semester is the European Union’s framework for the coordination and surveillance of economic and social policies.}
d
South African Social Attitudes Survey (SASAS) 2006: Combined data with...
demo-b2find.dkrz.de
Updated Sep 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). South African Social Attitudes Survey (SASAS) 2006: Combined data with household weight - All provinces - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/d65e8780-c31c-5683-9a82-3d00c5e17fb1
Explore at:
Dataset updated
Sep 20, 2025
Area covered
South Africa
Description
Description: The harmonised core module data are available in the combined dataset. The questions contained in the core modules of the two SASAS questionnaires for 2006 (demographics and core thematic issues) were asked of 7000 respondents, while the remaining rotating modules were asked of a half sample of approximately 3500 respondents each. The combined data set contains 5843 records and 157 variables. Topics included in the questionnaires are: democracy, identity, public services, moral issues, crime, voting, demographics and other classificatory variables. This version of the combined dataset should be used where analysis is to be performed at household level. Abstract: The primary objective of the South African Social Attitudes Survey (SASAS) is to design, develop and implement a conceptually and methodologically robust study of changing social attitudes and values in South Africa. In meeting this objective, the HSRC is carefully and consistently monitoring and providing insight into changes in attitudes among various socio-demographic groupings. SASAS is intended to provide a unique long-term account of the social fabric of modern South Africa, and of how its changing political and institutional structures interact over time with changing social attitudes and values. The survey has been designed to yield a national representative sample of adults aged 16 and older, using the Human Sciences Research Council's (HSRC) Master Sample, which was designed in 2002 and consists of 1000 primary sampling units (PSUs). These PSUs were drawn, with probability proportional to size from a pre-census 2001 list of 80780 enumerator areas (EAs). As the basis of the 2006 SASAS round of interviewing, a sub-sample of 500 EAs (PSUs) was drawn from the master sample. Three explicit stratification variables were used, namely province, geographic type and majority population group. The survey is conducted annually and the 2006 survey is the fourth wave in the series. To accommodate the wide variety of topics included in the survey, two questionnaires are administered simultaneously. Apart from the standard set of demographic and background variables, each version of the questionnaire contained a harmonised core module. The questions contained in the core modules of the two SASAS questionnaires (demographics and core thematic issues) were asked of 7000 respondents, while the remaining rotating modules were asked of a half sample of approximately 3500 respondents each. The core module remains constant for with the aim of monitoring change and continuity in a variety of socio-economic and socio-political variables. In addition, a number of themes are accommodated in rotation. The rotating element of the survey consists of two or more topic-specific modules in each round of interviewing and is directed at measuring a range of policy and academic concerns and issues that require more detailed examination at a specific point in time than the multi-topic core module would permit. Topics included in the questionnaires are: democracy, national identity, public services, moral issues, crime, voting, demographics and other classificatory variables. Rotating modules are: media and communication, health status and behavior, social exclusion, tourism and leisure, intergroup relations, Soccer World Cup, work and welfare, social exclusion, democracy part 2, water services and poverty. International Social Survey Programme. (ISSP web page:www.issp.org/) The International Social Survey Programme (ISSP) is run by a group of research organisations, each of which undertakes to field annually an agreed module of questions on a chosen topic area. SASAS 2003 represents the formalisation of South Africa's inclusion in the ISSP, the intention being to include the module in one of the SASAS questionnaires in each round of interviewing. Each module is chosen for repetition at intervals to allow comparisons both between countries (membership currently stands at 48) and over time. In 2006, the chosen subject was the role of government, and the module was carried in version two of the questionnaire (Qs.174-229.This data can be accessed through the ISSP data portal (see link above).
u
Synthetic Administrative Data: Census 1991, 2023
datacatalogue.ukdataservice.ac.uk
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shlomo, N, University of Manchester; Kim, M, University of Manchester (2024). Synthetic Administrative Data: Census 1991, 2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-856310
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-856310
Dataset updated
Feb 21, 2024
Authors
Shlomo, N, University of Manchester; Kim, M, University of Manchester
Area covered
United Kingdom
Description
We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.
d
Data from: Research on potential disruptive technology identification based...
search.dataone.org
datasetcatalog.nlm.nih.gov
+1more
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. L. Ding (2025). Research on potential disruptive technology identification based on technology network [Dataset]. http://doi.org/10.5061/dryad.z612jm6j8
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.z612jm6j8
Dataset updated
Jul 25, 2025
Dataset provided by
Dryad Digital Repository
Authors
M. L. Ding
Time period covered
Jan 1, 2023
Description
Three evident and meaningful characteristics of disruptive technology are the zeroing effect that causes sustaining technology useless for its remarkable and unprecedented progress, reshaping the landscape of technology and economy, and leading the future mainstream of technology system, all of which have profound impacts and positive influences. The identification of disruptive technology is a universally difficult task. Therefore, the paper aims to enhance the technical relevance of potential disruptive technology identification results and improve the granularity and effectiveness of potential disruptive technology identification topics. According to the life cycle theory, dividing the time stage, then constructing and analyzing the dynamic of technology networks to identify potential disruptive technology. Thereby, using the LDA topic model further to clarify the topic content of potential disruptive technologies. This paper takes the large civil UAVs as an example to prove the feas..., Through the analysis of the technology life cycle, the division of the patents, the construction of the technology network, the identification of nodes leaping, the clustering of technical topics, we aim to identify potential disruptive technology. Â

Procedures:

Knowledge flow: being familiar with the technical background knowledge in the field of large civil UAVs, and accomplishing the technical decomposition. Invention patents: analyzing the technology life cycle by the loget lab to separate the invention patents into four parts. According to each part, constructing the IPC technical network and identifying the leapfrogging and diffusible nodes. Technical topics: making use of the LDA model to cluster and explain the broad and various content of the inventions.

Â

Testing: Dividing the inventions of the embryonic stage into two groups and examining them by means of the Mann-Whitney test. Finally, the result shows the huge differences in the patent value, sustaining influence, and c..., , This README file was generated on 2023-11-25 by Mingli Ding.

GENERAL INFORMATION

Title of Dataset: technical network in the field of large civilian UAVs

Author Information

Investigators Contact Information Name: Mingli Ding; Wangke Yu; Ran Li; Zhenzhen Wang; Jianing Li Institution: Jingdezhen Ceramic University Address: Jingdezhen, Jiangxi, China Email:

Date of data collection:2005-2018

DATA & FILE OVERVIEW

File List:

A)patent (2005-2008).csv

B)patents (2009-2012).csv

C)patents (2013-2015).csv

D)patents (2016-2018).csv

E)technical network (2005-2008).csv

F)technical network (2009-2012).csv

G)technical networks (2013-2015).csv

H)technical network (2016-2018).csv

DATA-SPECIFIC INFORMATION FOR: patent (2005-2008).csv

Number of variables: 2

Number of cases/rows: 234

Variable List:

source: the source of the edges in the technical network

target: the target of the edges in the technical network

4. Specialized fo...
f
Data from: Variable description.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan (2023). Variable description. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001031498
Explore at:
Dataset updated
Dec 7, 2023
Authors
Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan
Description
Studies in the past have examined asthma prevalence and the associated risk factors in the United States using data from national surveys. However, the findings of these studies may not be relevant to specific states because of the different environmental and socioeconomic factors that vary across regions. The 2019 Behavioral Risk Factor Surveillance System (BRFSS) showed that Michigan had higher asthma prevalence rates than the national average. In this regard, we employ various modern machine learning techniques to predict asthma and identify risk factors associated with asthma among Michigan adults using the 2019 BRFSS data. After data cleaning, a sample of 10,337 individuals was selected for analysis, out of which 1,118 individuals (10.8%) reported having asthma during the survey period. Typical machine learning techniques often perform poorly due to imbalanced data issues. To address this challenge, we employed two synthetic data generation techniques, namely the Random Over-Sampling Examples (ROSE) and Synthetic Minority Over-Sampling Technique (SMOTE) and compared their performances. The overall performance of machine learning algorithms was improved using both methods, with ROSE performing better than SMOTE. Among the ROSE-adjusted models, we found that logistic regression, partial least squares, gradient boosting, LASSO, and elastic net had comparable performance, with sensitivity at around 50% and area under the curve (AUC) at around 63%. Due to ease of interpretability, logistic regression is chosen for further exploration of risk factors. Presence of chronic obstructive pulmonary disease, lower income, female sex, financial barrier to see a doctor due to cost, taken flu shot/spray in the past 12 months, 18–24 age group, Black, non-Hispanic group, and presence of diabetes are identified as asthma risk factors. This study demonstrates the potentiality of machine learning coupled with imbalanced data modeling approaches for predicting asthma from a large survey dataset. We conclude that the findings could guide early screening of at-risk asthma patients and designing appropriate interventions to improve care practices.
Data from: National Science Foundation Surveys of Public Attitudes Toward...
icpsr.umich.edu
ascii, sas, spss
Updated Jan 18, 2006
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miller, Jon D.; Kimmel, Linda (2006). National Science Foundation Surveys of Public Attitudes Toward and Understanding of Science and Technology, 1979-2001: [United States] [Dataset]. http://doi.org/10.3886/ICPSR04029.v1
Explore at:
spss, ascii, sasAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR04029.v1
Dataset updated
Jan 18, 2006
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Miller, Jon D.; Kimmel, Linda
License
https://www.icpsr.umich.edu/web/ICPSR/studies/4029/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/4029/terms
Time period covered
1979 - 2001
Area covered
United States
Description
The National Science Foundation (NSF) Surveys of Public Attitudes monitored the general public's attitudes toward and interest in science and technology. In addition, the survey assessed levels of literacy and understanding of scientific and environmental concepts and constructs, how scientific knowledge and information were acquired, attentiveness to public policy issues, and computer access and usage. Since 1979, the survey was administered at regular intervals (occurring every two or three years), producing 11 cross-sectional surveys through 2001. Data for Part 1 (Survey of Public Attitudes Multiple Wave Data) were comprised of the survey questionnaire items asked most often throughout the 22-year survey series and account for approximately 70 percent of the original questions asked. Data for Part 2, General Social Survey Subsample Data, combine the 1983-1999 Survey of Public Attitudes data with a subsample from the 2002 General Social Survey (GSS) (GENERAL SOCIAL SURVEYS, 1972-2002: [CUMULATIVE FILE] [ICPSR 3728]) and focus solely on levels of education and computer access and usage. Variables for Part 1 include the respondents' interest in new scientific or medical discoveries and inventions, space exploration, military and defense policies, whether they voted in a recent election, if they had ever contacted an elected or public official about topics regarding science, energy, defense, civil rights, foreign policy, or general economics, and how they felt about government spending on scientific research. Respondents were asked how they received information concerning science or news (e.g., via newspapers, magazines, or television), what types of television programming they watched, and what kind of magazines they read. Respondents were asked a series of questions to assess their understanding of scientific concepts like DNA, probability, and experimental methods. Respondents were also asked if they agreed with statements concerning science and technology and how they affect everyday living. Respondents were further asked a series of true and false questions regarding science-based statements (e.g., the center of the Earth is hot, all radioactivity is manmade, electrons are smaller than atoms, the Earth moves around the sun, humans and dinosaurs co-existed, and human beings developed from earlier species of animals). Variables for Part 2 include highest level of math attained in high school, whether the respondent had a postsecondary degree, field of highest degree, number of science-based college courses taken, major in college, household ownership of a computer, access to the World Wide Web, number of hours spent on a computer at home or at work, and topics searched for via the Internet. Demographic variables for Parts 1 and 2 include gender, race, age, marital status, number of people in household, level of education, and occupation.
Social Attitudes Survey 2005 - South Africa
microdata.worldbank.org
catalog.ihsn.org
+1more
Updated May 7, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Human Sciences Research Council (HSRC) (2014). Social Attitudes Survey 2005 - South Africa [Dataset]. https://microdata.worldbank.org/index.php/catalog/1594
Explore at:
Dataset updated
May 7, 2014
Dataset provided by
Human Sciences Research Councilhttps://hsrc.ac.za/
Authors
Human Sciences Research Council (HSRC)
Time period covered
2005
Area covered
South Africa
Description
Abstract

The primary objective of SASAS is to design, develop and implement a conceptually and methodologically robust study of changing social attitudes and values in South Africa to be able to carefully and consistently monitor and explain changes in attitudes amongst various socio-demographic groupings. The SASAS explores a wide range of value changes, including the distribution and shape of racial attitudes and aspirations, attitudes towards democratic and constitutional issues, and the redistribution of resources and power. Moreover, there is also an explicit interest in mapping changing attitudes towards some of the moral issues that confront and are fiercely debated in South Africa, such as gender issues, AIDS, crime and punishment, governance, and service delivery. The SASAS is intended to provide a unique long-term account of the social fabric of modern South Africa, and of how its changing political and institutional structures interact over time with changing social attitudes and values.

Geographic coverage

National coverage

Analysis unit

The units of analysis in the study are households and individuals

Universe

The population under investigation includes adults aged 16 and older in private households in South Africa

Kind of data

Sample survey data [ssd]

Sampling procedure

Sampling Design The South African Social Attitudes Survey has been designed to yield a representative sample of adults aged 16 and older. The sampling frame for the survey is the Human Sciences Research Council’s (HSRC) Master Sample, which was designed in 2002 and consists of 1 000 primary sampling units (PSUs). The 2001 population census enumerator areas (EAs) were used as PSUs. These PSUs were drawn, with probability proportional to size, from a pre-census 2001 list of EAs provided by Statistics South Africa.

The Master Sample excludes special institutions (such as hospitals, military camps, old age homes, school and university hostels), recreational areas, industrial areas and vacant EAs. It therefore focuses on dwelling units or visiting points as secondary sampling units, whic have been defined as ‘separate (non-vacant) residential stands, addresses, structures, flats, homesteads, etc.’.

As the basis of the 2005 SASAS round of interviewing, a sub-sample of 500 PSUs was drawn from the HSRC’s Master Sample. Three explicit stratification variables were used, namely province, geographic type and majority population group.

Within each stratum, the allocated number of PSUs was drawn using proportional to size probability sampling. In each of these drawn PSUs, two clusters of 7 dwelling units each were drawn. These 14 dwelling units in each drawn PSU were systematically grouped into two subsamples of seven, to give the two SASAS samples.

Number of units: Questionnaire 1: 2 497 cases realised from 3 500 addresses; questionnaire 2: 2 483 cases realised from 3 500 addresses; combined : 4980 cases

Mode of data collection

Face-to-face [f2f]

Research instrument

To accommodate the wide variety of topics that was included in the 2005 survey, two questionnaires were administered simultaneously. Apart from the standard set of demographic and background variables, each version of the questionnaire contained a harmonised core module that remains constant from round to round, with the aim of monitoring change and continuity in a variety of socio-economic and socio-political variables. In addition, a number of themes are accommodated on a rotational basis. This rotating element of the survey consists of two or more topic-specific modules in each round of interviewing and is directed at measuring a range of policy and academic concerns and issues that require more detailed examination at a specific point in time than the multi-topic core module would permit.

Questions for the core module were asked of both samples (3 500 respondents each – 7 000) of which 5 734 realised.

The ISSP module: The International Social Survey Programme (ISSP) is run by a group of research organisations, each of which undertakes to field annually an agreed module of questions on a chosen topic area. SASAS 2003 represents the formalisation of South Africa's inclusion in the ISSP, the intention being to include the module in one of the SASAS questionnaires in each round of interviewing. Each module is chosen for repetition at intervals to allow comparisons both between countries (membership currently stands at 45) and over time. In 2005, the chosen subject was work orientation, and the module was carried in version 2 of the questionnaire (Qs.98-169).

The standard questionnaires dealt with democracy, identity, public services, social values, crime, voting, demographics, families and family authority The rotating modules in the 2005 survey covered: Questionnaire 1: Poverty and social exclusion, family life Questionnaire 2: ISSP module (work orientation), soccer World Cup, democracy part 2

Facebook

Twitter

Click to copy link

Link copied

Cite

James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas (2023). Data Fission: Splitting a Single Data Point [Dataset]. http://doi.org/10.6084/m9.figshare.24328745.v2

Data from: Data Fission: Splitting a Single Data Point

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24328745.v2

Dataset updated

Dec 14, 2023

Dataset provided by

Taylor & Francishttps://taylorandfrancis.com/

Authors

James Leiner; Boyan Duan; Larry Wasserman; Aaditya Ramdas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Data from: Data Fission: Splitting a Single Data Point

2023 Census totals by topic for families and extended families by...

Living Standards Survey, Wave 3 (extension), 2007-2008 - Timor-Leste

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Cleaning operations

Data appraisal

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

Top 10 words of two topics with highest absolute values of regression...

2023 Census totals by topic for individuals by statistical area 2 – part 1

2023 Census totals by topic for individuals by statistical area 2 – part 2

Top 10 words of two topics with highest absolute values of regression...

Baseline multiple linear regression model with end fitness as the response...

Inability to keep home adequately warm by NUTS 2 region

Descriptive statistics of the variables.

MSE and sample standard deviation on test set of movie rating score...

Michigan Public Policy Survey, 2009 - 2016

Inability to face unexpected financial expenses by NUTS 2 region

South African Social Attitudes Survey (SASAS) 2006: Combined data with...

Synthetic Administrative Data: Census 1991, 2023

Data from: Research on potential disruptive technology identification based...

DATA & FILE OVERVIEW

DATA-SPECIFIC INFORMATION FOR: patent (2005-2008).csv

Data from: Variable description.

Data from: National Science Foundation Surveys of Public Attitudes Toward...

Social Attitudes Survey 2005 - South Africa

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Data from: Data Fission: Splitting a Single Data Point