Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.
Facebook
Twitterhttps://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset contains counts and measures for families and extended families from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.
The variables included in this dataset are for families and extended families in households in occupied private dwellings:
Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.
Footnotes
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Concept descriptions and quality ratings
Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Measures
Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.
Percentages
To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.
Symbol
-997 Not available
-999 Confidential
Inconsistencies in definitions
Please note that there may be differences in definitions between census classifications and those used for other data collections.
Facebook
TwitterIn 2007-2008 a multi-topic household survey, the Timor Leste Living Standards Survey (LSS-2) was conducted in East Timor with the main objectives of developing a system of poverty monitoring and supporting poverty reduction, and to monitor human development indicators and progress toward the Millennium Development Goals. The LSS-3 extension survey was designed to re-visit one third of the households interviewed under the LSS-2 to explore different facets of household welfare and behaviour in the country, while also being able to make use of information collected in the LSS-2 survey for analytic purposes. The four new topics investigated in the extension survey are:
National coverage
Households
Sample survey data [ssd]
SAMPLE DESIGN FOR THE 2008 EXTENSION SURVEY
Sampling for the LSS-3 Extension survey was a sub-sample of the original LSS-“ sample. The LSS-2 field work was divided into 52 "weeks", with each week being a random subset of the total sample. The sub-sample was chosen by randomly selecting 19 weeks from the original field work schedule. Each week contained seven Primary Sampling Units (PSUs) for a total of 133 PSUs. In each PSU the teams were to interview 12 of the original 15 households, with the remaining three to serve as replacements. The total nominal sample size was thus 1596.
Additional interviews: Following the collection and initial analysis of the data, it was determined that data from one district, Manatuto, and partially from another district, Oecussi, were of insufficient quality in certain modules. Therefore, it was decided to repeat the survey in another 25 PSUs of these two districts - six in Manatuto, and 19 in Oecussi. The additional PSUs chosen were randomly selected within the two districts from the remaining non-panel PSUs in the original LSS-2 sample.
Face-to-face [f2f]
DATA CLEANING
The LSS-3 had a significant number of responses in which the response is "other". In general, if the response clear fit into a pre-coded response category, it was recoded into that category during the cleaning and compilation process. Some responses where additional information was provided were not recoded even though they clearly fit into pre-coded categories. For example, agriculture project" would be recoded into the "agriculture" category, while "community garden" would not. Data users can either use the additional information, or re-code into categories as they see fit. Potential Data Quality Issues in 2008 Extension survey
Potential Data Quality Issues in 2008 Extension survey
Agriculture: Similarly, to the individual roster of the previous section, the plots listed in the previous survey are listed on the pre-printed cover page and all changes noted. The agricultural section, similarly, to the other sections, suffers from problems with open-ended questions. This is particularly the case for the question asking what community restrictions are placed on the clearing of forest land (section 2d). The translation from the original question was vague (using the Tetun word for "boundary" for "restriction,") and therefore many of the responses relate to physical boundaries on the land, such as stone walls and tree lines. Additionally, the translation of all answers from Tetun into English is imperfect, and those wishing to use this information for analytical purposes are advised to also refer to the original Tetun. Analysts should be careful in using the data from the open ended questions because of translation problems. Also, it was noted during the training and field work that many interviewers had significant difficulties understanding definitions with some of the land management and investment questions. In general, however, all agricultural data may be used for analysis, sampling weights w3.
Finance: It should be noted that the quality of the data for the finance experiment (comparing the knowledge of the household head to that of other household members) was not sufficient for the experiment to be deemed a success. Subsequent spot-checking revealed that in many cases, interviewers asked the household head about the financial activities of various household members instead of asking them directly. Therefore, this data should only be used to measure the access to finance at the household level. The finance sections were not repeated during the additional interviews in the replacement PSUs. Sampling weights w1 should be used when doing any analysis with this data.
Shocks and Vulnerability: It was determined following the initial round of data collection that the shocks and vulnerability module had some issues with uneven interview quality. Two reasons were listed as potential causes of the data quality issues: (1) fundamental inability to adequately translate both the word and concept of a "shock" into the Timorese context, and (2) incomplete / questionable responses to the health shock questions in particular. Analysis for health shocks should drop the "questionable" households and use the "re-interview" households, sampling weights w2.
Justice for the Poor: Similar to the shocks and vulnerability module, the justice module included a long series of follow up questions if the household indicated having experienced a dispute during the recall period. Again, the number of disputes experienced by the household seemed extremely low compared to expectations. This was particularly a problem with the Manatuto district in which no disputes were recorded during the first set of TLSLS2-X interviews. Analysis for the disputes section of the justice module should drop the "questionable" households and use the "re-interview" households, sampling weights w2. The justice model also has a number of instances in which the specifications for "other" were not recorded. Every effort was made to ensure this data was as complete as possible, but gaps do remain. Also, data users should use caution when using the imputed rank variable in section 5D. The rank in terms of importance was not explicitly captured in the data entry software, and the rankings therefore had to be imputed from the order they were listed in the original data entry. Inconsistencies may exist in this variable.
Facebook
TwitterWe provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top 10 words of two topics with highest absolute values of regression coefficients and the topic coherence measured in NPMI on the training set when K = 15.
Facebook
Twitterhttps://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.
The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification (unless otherwise stated).
The variables for part 1 of the dataset are:
Download lookup file for part 1 from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.
Footnotes
Te Whata
Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Subnational census usually resident population
The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.
Population counts
Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).
Study participation time series
In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Concept descriptions and quality ratings
Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.
Disability indicator
This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.
Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Measures
Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures calculations. Averages and medians based on less than six units (e.g. individuals, dwellings, households, families, or extended families) are suppressed. This suppression threshold changes for other quantiles. Where the cells have been suppressed, a placeholder value has been used.
Percentages
To calculate percentages, divide the figure for the category of interest by the figure for 'Total stated' where this applies.
Symbol
-997 Not available
-999 Confidential
Inconsistencies in definitions
Please note that there may be differences in definitions between census classifications and those used for other data collections.
Facebook
Twitterhttps://datafinder.stats.govt.nz/license/attribution-4-0-international/https://datafinder.stats.govt.nz/license/attribution-4-0-international/
Dataset contains counts and measures for individuals from the 2013, 2018, and 2023 Censuses. Data is available by statistical area 2.
The variables included in this dataset are for the census usually resident population count (unless otherwise stated). All data is for level 1 of the classification.
The variables for part 2 of the dataset are:
Download lookup file from Stats NZ ArcGIS Online or embedded attachment in Stats NZ geographic data service. Download data table (excluding the geometry column for CSV files) using the instructions in the Koordinates help guide.
Footnotes
Te Whata
Under the Mana Ōrite Relationship Agreement, Te Kāhui Raraunga (TKR) will be publishing Māori descent and iwi affiliation data from the 2023 Census in partnership with Stats NZ. This will be available on Te Whata, a TKR platform.
Geographical boundaries
Statistical standard for geographic areas 2023 (updated December 2023) has information about geographic boundaries as of 1 January 2023. Address data from 2013 and 2018 Censuses was updated to be consistent with the 2023 areas. Due to the changes in area boundaries and coding methodologies, 2013 and 2018 counts published in 2023 may be slightly different to those published in 2013 or 2018.
Subnational census usually resident population
The census usually resident population count of an area (subnational count) is a count of all people who usually live in that area and were present in New Zealand on census night. It excludes visitors from overseas, visitors from elsewhere in New Zealand, and residents temporarily overseas on census night. For example, a person who usually lives in Christchurch city and is visiting Wellington city on census night will be included in the census usually resident population count of Christchurch city.
Population counts
Stats NZ publishes a number of different population counts, each using a different definition and methodology. Population statistics – user guide has more information about different counts.
Caution using time series
Time series data should be interpreted with care due to changes in census methodology and differences in response rates between censuses. The 2023 and 2018 Censuses used a combined census methodology (using census responses and administrative data), while the 2013 Census used a full-field enumeration methodology (with no use of administrative data).
Study participation time series
In the 2013 Census study participation was only collected for the census usually resident population count aged 15 years and over.
About the 2023 Census dataset
For information on the 2023 dataset see Using a combined census model for the 2023 Census. We combined data from the census forms with administrative data to create the 2023 Census dataset, which meets Stats NZ's quality criteria for population structure information. We added real data about real people to the dataset where we were confident the people who hadn’t completed a census form (which is known as admin enumeration) will be counted. We also used data from the 2018 and 2013 Censuses, administrative data sources, and statistical imputation methods to fill in some missing characteristics of people and dwellings.
Data quality
The quality of data in the 2023 Census is assessed using the quality rating scale and the quality assurance framework to determine whether data is fit for purpose and suitable for release. Data quality assurance in the 2023 Census has more information.
Concept descriptions and quality ratings
Data quality ratings for 2023 Census variables has additional details about variables found within totals by topic, for example, definitions and data quality.
Disability indicator
This data should not be used as an official measure of disability prevalence. Disability prevalence estimates are only available from the 2023 Household Disability Survey. Household Disability Survey 2023: Final content has more information about the survey.
Activity limitations are measured using the Washington Group Short Set (WGSS). The WGSS asks about six basic activities that a person might have difficulty with: seeing, hearing, walking or climbing stairs, remembering or concentrating, washing all over or dressing, and communicating. A person was classified as disabled in the 2023 Census if there was at least one of these activities that they had a lot of difficulty with or could not do at all.
Using data for good
Stats NZ expects that, when working with census data, it is done so with a positive purpose, as outlined in the Māori Data Governance Model (Data Iwi Leaders Group, 2023). This model states that "data should support transformative outcomes and should uplift and strengthen our relationships with each other and with our environments. The avoidance of harm is the minimum expectation for data use. Māori data should also contribute to iwi and hapū tino rangatiratanga”.
Confidentiality
The 2023 Census confidentiality rules have been applied to 2013, 2018, and 2023 data. These rules protect the confidentiality of individuals, families, households, dwellings, and undertakings in 2023 Census data. Counts are calculated using fixed random rounding to base 3 (FRR3) and suppression of ‘sensitive’ counts less than six, where tables report multiple geographic variables and/or small populations. Individual figures may not always sum to stated totals. Applying confidentiality rules to 2023 Census data and summary of changes since 2018 and 2013 Censuses has more information about 2023 Census confidentiality rules.
Measures
Measures like averages, medians, and other quantiles are calculated from unrounded counts, with input noise added to or subtracted from each contributing value during measures
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Top 10 words of two topics with highest absolute values of regression coefficients.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Baseline multiple linear regression model with end fitness as the response variable, showing the calculated variable inflation factors (VIFs).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The European Union Statistics on Income and Living Conditions (EU-SILC) collects timely and comparable multidimensional microdata on income, poverty, social exclusion and living conditions.
The EU-SILC collection is a key instrument for providing information required by the European Semester ([1]) and the European Pillar of Social Rights, and the main source of data for microsimulation purposes and flash estimates of income distribution and poverty rates.
AROPE remains crucial to monitor European social policies, especially to monitor the EU 2030 target on poverty and social exclusion. For more information, please consult EU social indicators.
The EU-SILC instrument provides two types of data:
EU-SILC collects:
The variables collected are grouped by topic and detailed topic and transmitted to Eurostat in four main files (D-File, H-File, R-File and P-file).
The domain ‘Income and Living Conditions’ covers the following topics: persons at risk of poverty or social exclusion, income inequality, income distribution and monetary poverty, living conditions, material deprivation, and EU-SILC ad-hoc modules, which are structured into collections of indicators on specific topics.
In 2023, in addition to annual data, in EU-SILC were collected: the three yearly module on labour market and housing, the six yearly module on intergenerational transmission of advantages and disadvantages, housing difficulties, and the ad hoc subject on households energy efficiency.
Starting from 2021 onwards, the EU quality reports use the structure of the Single Integrated Metadata Structure (SIMS).
([1]) The European Semester is the European Union’s framework for the coordination and surveillance of economic and social policies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Corporate financialization is a growing concern in China, and its impact on the main business of real enterprises is a crucial topic. This paper uses data from all A-share non-financial listed companies in China between 2013 and 2022 to establish a dynamic panel threshold model and test the effect of corporate financialization on enterprise performance. The empirical results indicate a threshold effect between the two variables, corporate financialization has both positive and negative effects on main business performance, with a threshold of 5.82%. Additionally, significant heterogeneous results are found for the nature of ownership, asset maturity, industry and regional distribution.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MSE and sample standard deviation on test set of movie rating score prediction when K = 15.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/39057/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39057/terms
The Michigan Public Policy Survey (MPPS) is a program of state-wide surveys of local government leaders in Michigan. The MPPS is designed to fill an important information gap in the policymaking process. While there are ongoing surveys of the business community and of the citizens of Michigan, before the MPPS there were no ongoing surveys of local government officials that were representative of all general purpose local governments in the state. Therefore, while we knew the policy priorities and views of the state's businesses and citizens, we knew very little about the views of the local officials who are so important to the economies and community life throughout Michigan. The MPPS was launched in 2009 by the Center for Local, State, and Urban Policy (CLOSUP) at the University of Michigan and is conducted in partnership with the Michigan Association of Counties, Michigan Municipal League, and Michigan Townships Association. The associations provide CLOSUP with contact information for the survey's respondents, and consult on survey topics. CLOSUP makes all decisions on survey design, data analysis, and reporting, and receives no funding support from the associations. The surveys investigate local officials' opinions and perspectives on a variety of important public policy issues and solicit factual information about their localities relevant to policymaking. Over time, the program has covered issues such as fiscal, budgetary and operational policy, fiscal health, public sector compensation, workforce development, local-state governmental relations, intergovernmental collaboration, economic development strategies and initiatives such as placemaking and economic gardening, the role of local government in environmental sustainability, energy topics such as hydraulic fracturing ("fracking") and wind power, trust in government, views on state policymaker performance, opinions on the impacts of the Federal Stimulus Program (ARRA), and more. The program will investigate many other issues relevant to local and state policy in the future. A searchable database of every question the MPPS has asked is available on CLOSUP's website. Results of MPPS surveys are currently available as reports, and via online data tables. The MPPS datasets are being released in two forms: public-use datasets and restricted-use datasets. Unlike the public-use datasets, the restricted-use datasets represent full MPPS survey waves, and include all of the survey questions from a wave. Restricted-use datasets also allow for multiple waves to be linked together for longitudinal analysis. The MPPS staff do still modify these restricted-use datasets to remove jurisdiction and respondent identifiers and to recode other variables in order to protect confidentiality. However, it is theoretically possible that a researcher might be able, in some rare cases, to use enough variables from a full dataset to identify a unique jurisdiction, so access to these datasets is restricted and approved on a case-by-case basis. CLOSUP encourages researchers interested in the MPPS to review the codebooks included in this data collection to see the full list of variables including those not found in the public-use datasets, and to explore the MPPS data using the public-use-datasets. The codebooks for these restricted use datasets are available for download on CLOSUP's website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The European Union Statistics on Income and Living Conditions (EU-SILC) collects timely and comparable multidimensional microdata on income, poverty, social exclusion and living conditions.
The EU-SILC collection is a key instrument for providing information required by the European Semester ([1]) and the European Pillar of Social Rights, and the main source of data for microsimulation purposes and flash estimates of income distribution and poverty rates.
AROPE remains crucial to monitor European social policies, especially to monitor the EU 2030 target on poverty and social exclusion. For more information, please consult EU social indicators.
The EU-SILC instrument provides two types of data:
EU-SILC collects:
The variables collected are grouped by topic and detailed topic and transmitted to Eurostat in four main files (D-File, H-File, R-File and P-file).
The domain ‘Income and Living Conditions’ covers the following topics: persons at risk of poverty or social exclusion, income inequality, income distribution and monetary poverty, living conditions, material deprivation, and EU-SILC ad-hoc modules, which are structured into collections of indicators on specific topics.
In 2023, in addition to annual data, in EU-SILC were collected: the three yearly module on labour market and housing, the six yearly module on intergenerational transmission of advantages and disadvantages, housing difficulties, and the ad hoc subject on households energy efficiency.
Starting from 2021 onwards, the EU quality reports use the structure of the Single Integrated Metadata Structure (SIMS).
([1]) The European Semester is the European Union’s framework for the coordination and surveillance of economic and social policies.
Facebook
TwitterDescription: The harmonised core module data are available in the combined dataset. The questions contained in the core modules of the two SASAS questionnaires for 2006 (demographics and core thematic issues) were asked of 7000 respondents, while the remaining rotating modules were asked of a half sample of approximately 3500 respondents each. The combined data set contains 5843 records and 157 variables. Topics included in the questionnaires are: democracy, identity, public services, moral issues, crime, voting, demographics and other classificatory variables. This version of the combined dataset should be used where analysis is to be performed at household level. Abstract: The primary objective of the South African Social Attitudes Survey (SASAS) is to design, develop and implement a conceptually and methodologically robust study of changing social attitudes and values in South Africa. In meeting this objective, the HSRC is carefully and consistently monitoring and providing insight into changes in attitudes among various socio-demographic groupings. SASAS is intended to provide a unique long-term account of the social fabric of modern South Africa, and of how its changing political and institutional structures interact over time with changing social attitudes and values. The survey has been designed to yield a national representative sample of adults aged 16 and older, using the Human Sciences Research Council's (HSRC) Master Sample, which was designed in 2002 and consists of 1000 primary sampling units (PSUs). These PSUs were drawn, with probability proportional to size from a pre-census 2001 list of 80780 enumerator areas (EAs). As the basis of the 2006 SASAS round of interviewing, a sub-sample of 500 EAs (PSUs) was drawn from the master sample. Three explicit stratification variables were used, namely province, geographic type and majority population group. The survey is conducted annually and the 2006 survey is the fourth wave in the series. To accommodate the wide variety of topics included in the survey, two questionnaires are administered simultaneously. Apart from the standard set of demographic and background variables, each version of the questionnaire contained a harmonised core module. The questions contained in the core modules of the two SASAS questionnaires (demographics and core thematic issues) were asked of 7000 respondents, while the remaining rotating modules were asked of a half sample of approximately 3500 respondents each. The core module remains constant for with the aim of monitoring change and continuity in a variety of socio-economic and socio-political variables. In addition, a number of themes are accommodated in rotation. The rotating element of the survey consists of two or more topic-specific modules in each round of interviewing and is directed at measuring a range of policy and academic concerns and issues that require more detailed examination at a specific point in time than the multi-topic core module would permit. Topics included in the questionnaires are: democracy, national identity, public services, moral issues, crime, voting, demographics and other classificatory variables. Rotating modules are: media and communication, health status and behavior, social exclusion, tourism and leisure, intergroup relations, Soccer World Cup, work and welfare, social exclusion, democracy part 2, water services and poverty. International Social Survey Programme. (ISSP web page:www.issp.org/) The International Social Survey Programme (ISSP) is run by a group of research organisations, each of which undertakes to field annually an agreed module of questions on a chosen topic area. SASAS 2003 represents the formalisation of South Africa's inclusion in the ISSP, the intention being to include the module in one of the SASAS questionnaires in each round of interviewing. Each module is chosen for repetition at intervals to allow comparisons both between countries (membership currently stands at 48) and over time. In 2006, the chosen subject was the role of government, and the module was carried in version two of the questionnaire (Qs.174-229.This data can be accessed through the ISSP data portal (see link above).
Facebook
TwitterWe create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.
Facebook
TwitterThree evident and meaningful characteristics of disruptive technology are the zeroing effect that causes sustaining technology useless for its remarkable and unprecedented progress, reshaping the landscape of technology and economy, and leading the future mainstream of technology system, all of which have profound impacts and positive influences. The identification of disruptive technology is a universally difficult task. Therefore, the paper aims to enhance the technical relevance of potential disruptive technology identification results and improve the granularity and effectiveness of potential disruptive technology identification topics. According to the life cycle theory, dividing the time stage, then constructing and analyzing the dynamic of technology networks to identify potential disruptive technology. Thereby, using the LDA topic model further to clarify the topic content of potential disruptive technologies. This paper takes the large civil UAVs as an example to prove the feas..., Through the analysis of the technology life cycle, the division of the patents, the construction of the technology network, the identification of nodes leaping, the clustering of technical topics, we aim to identify potential disruptive technology. Â
Procedures:
Knowledge flow: being familiar with the technical background knowledge in the field of large civil UAVs, and accomplishing the technical decomposition. Invention patents: analyzing the technology life cycle by the loget lab to separate the invention patents into four parts. According to each part, constructing the IPC technical network and identifying the leapfrogging and diffusible nodes. Technical topics: making use of the LDA model to cluster and explain the broad and various content of the inventions.
Â
Testing: Dividing the inventions of the embryonic stage into two groups and examining them by means of the Mann-Whitney test. Finally, the result shows the huge differences in the patent value, sustaining influence, and c..., , This README file was generated on 2023-11-25 by Mingli Ding.
GENERAL INFORMATION
Title of Dataset: technical network in the field of large civilian UAVs
Author Information
Investigators Contact Information Name: Mingli Ding; Wangke Yu; Ran Li; Zhenzhen Wang; Jianing Li Institution: Jingdezhen Ceramic University Address: Jingdezhen, Jiangxi, China Email:
A)patent (2005-2008).csv
B)patents (2009-2012).csv
C)patents (2013-2015).csv
D)patents (2016-2018).csv
E)technical network (2005-2008).csv
F)technical network (2009-2012).csv
G)technical networks (2013-2015).csv
H)technical network (2016-2018).csv
Number of variables: 2
Number of cases/rows: 234
Variable List:
4. Specialized fo...
Facebook
TwitterStudies in the past have examined asthma prevalence and the associated risk factors in the United States using data from national surveys. However, the findings of these studies may not be relevant to specific states because of the different environmental and socioeconomic factors that vary across regions. The 2019 Behavioral Risk Factor Surveillance System (BRFSS) showed that Michigan had higher asthma prevalence rates than the national average. In this regard, we employ various modern machine learning techniques to predict asthma and identify risk factors associated with asthma among Michigan adults using the 2019 BRFSS data. After data cleaning, a sample of 10,337 individuals was selected for analysis, out of which 1,118 individuals (10.8%) reported having asthma during the survey period. Typical machine learning techniques often perform poorly due to imbalanced data issues. To address this challenge, we employed two synthetic data generation techniques, namely the Random Over-Sampling Examples (ROSE) and Synthetic Minority Over-Sampling Technique (SMOTE) and compared their performances. The overall performance of machine learning algorithms was improved using both methods, with ROSE performing better than SMOTE. Among the ROSE-adjusted models, we found that logistic regression, partial least squares, gradient boosting, LASSO, and elastic net had comparable performance, with sensitivity at around 50% and area under the curve (AUC) at around 63%. Due to ease of interpretability, logistic regression is chosen for further exploration of risk factors. Presence of chronic obstructive pulmonary disease, lower income, female sex, financial barrier to see a doctor due to cost, taken flu shot/spray in the past 12 months, 18–24 age group, Black, non-Hispanic group, and presence of diabetes are identified as asthma risk factors. This study demonstrates the potentiality of machine learning coupled with imbalanced data modeling approaches for predicting asthma from a large survey dataset. We conclude that the findings could guide early screening of at-risk asthma patients and designing appropriate interventions to improve care practices.
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/4029/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/4029/terms
The National Science Foundation (NSF) Surveys of Public Attitudes monitored the general public's attitudes toward and interest in science and technology. In addition, the survey assessed levels of literacy and understanding of scientific and environmental concepts and constructs, how scientific knowledge and information were acquired, attentiveness to public policy issues, and computer access and usage. Since 1979, the survey was administered at regular intervals (occurring every two or three years), producing 11 cross-sectional surveys through 2001. Data for Part 1 (Survey of Public Attitudes Multiple Wave Data) were comprised of the survey questionnaire items asked most often throughout the 22-year survey series and account for approximately 70 percent of the original questions asked. Data for Part 2, General Social Survey Subsample Data, combine the 1983-1999 Survey of Public Attitudes data with a subsample from the 2002 General Social Survey (GSS) (GENERAL SOCIAL SURVEYS, 1972-2002: [CUMULATIVE FILE] [ICPSR 3728]) and focus solely on levels of education and computer access and usage. Variables for Part 1 include the respondents' interest in new scientific or medical discoveries and inventions, space exploration, military and defense policies, whether they voted in a recent election, if they had ever contacted an elected or public official about topics regarding science, energy, defense, civil rights, foreign policy, or general economics, and how they felt about government spending on scientific research. Respondents were asked how they received information concerning science or news (e.g., via newspapers, magazines, or television), what types of television programming they watched, and what kind of magazines they read. Respondents were asked a series of questions to assess their understanding of scientific concepts like DNA, probability, and experimental methods. Respondents were also asked if they agreed with statements concerning science and technology and how they affect everyday living. Respondents were further asked a series of true and false questions regarding science-based statements (e.g., the center of the Earth is hot, all radioactivity is manmade, electrons are smaller than atoms, the Earth moves around the sun, humans and dinosaurs co-existed, and human beings developed from earlier species of animals). Variables for Part 2 include highest level of math attained in high school, whether the respondent had a postsecondary degree, field of highest degree, number of science-based college courses taken, major in college, household ownership of a computer, access to the World Wide Web, number of hours spent on a computer at home or at work, and topics searched for via the Internet. Demographic variables for Parts 1 and 2 include gender, race, age, marital status, number of people in household, level of education, and occupation.
Facebook
TwitterThe primary objective of SASAS is to design, develop and implement a conceptually and methodologically robust study of changing social attitudes and values in South Africa to be able to carefully and consistently monitor and explain changes in attitudes amongst various socio-demographic groupings. The SASAS explores a wide range of value changes, including the distribution and shape of racial attitudes and aspirations, attitudes towards democratic and constitutional issues, and the redistribution of resources and power. Moreover, there is also an explicit interest in mapping changing attitudes towards some of the moral issues that confront and are fiercely debated in South Africa, such as gender issues, AIDS, crime and punishment, governance, and service delivery. The SASAS is intended to provide a unique long-term account of the social fabric of modern South Africa, and of how its changing political and institutional structures interact over time with changing social attitudes and values.
National coverage
The units of analysis in the study are households and individuals
The population under investigation includes adults aged 16 and older in private households in South Africa
Sample survey data [ssd]
Sampling Design The South African Social Attitudes Survey has been designed to yield a representative sample of adults aged 16 and older. The sampling frame for the survey is the Human Sciences Research Council’s (HSRC) Master Sample, which was designed in 2002 and consists of 1 000 primary sampling units (PSUs). The 2001 population census enumerator areas (EAs) were used as PSUs. These PSUs were drawn, with probability proportional to size, from a pre-census 2001 list of EAs provided by Statistics South Africa.
The Master Sample excludes special institutions (such as hospitals, military camps, old age homes, school and university hostels), recreational areas, industrial areas and vacant EAs. It therefore focuses on dwelling units or visiting points as secondary sampling units, whic have been defined as ‘separate (non-vacant) residential stands, addresses, structures, flats, homesteads, etc.’.
As the basis of the 2005 SASAS round of interviewing, a sub-sample of 500 PSUs was drawn from the HSRC’s Master Sample. Three explicit stratification variables were used, namely province, geographic type and majority population group.
Within each stratum, the allocated number of PSUs was drawn using proportional to size probability sampling. In each of these drawn PSUs, two clusters of 7 dwelling units each were drawn. These 14 dwelling units in each drawn PSU were systematically grouped into two subsamples of seven, to give the two SASAS samples.
Number of units: Questionnaire 1: 2 497 cases realised from 3 500 addresses; questionnaire 2: 2 483 cases realised from 3 500 addresses; combined : 4980 cases
Face-to-face [f2f]
To accommodate the wide variety of topics that was included in the 2005 survey, two questionnaires were administered simultaneously. Apart from the standard set of demographic and background variables, each version of the questionnaire contained a harmonised core module that remains constant from round to round, with the aim of monitoring change and continuity in a variety of socio-economic and socio-political variables. In addition, a number of themes are accommodated on a rotational basis. This rotating element of the survey consists of two or more topic-specific modules in each round of interviewing and is directed at measuring a range of policy and academic concerns and issues that require more detailed examination at a specific point in time than the multi-topic core module would permit.
Questions for the core module were asked of both samples (3 500 respondents each – 7 000) of which 5 734 realised.
The ISSP module: The International Social Survey Programme (ISSP) is run by a group of research organisations, each of which undertakes to field annually an agreed module of questions on a chosen topic area. SASAS 2003 represents the formalisation of South Africa's inclusion in the ISSP, the intention being to include the module in one of the SASAS questionnaires in each round of interviewing. Each module is chosen for repetition at intervals to allow comparisons both between countries (membership currently stands at 45) and over time. In 2005, the chosen subject was work orientation, and the module was carried in version 2 of the questionnaire (Qs.98-169).
The standard questionnaires dealt with democracy, identity, public services, social values, crime, voting, demographics, families and family authority The rotating modules in the 2005 survey covered: Questionnaire 1: Poverty and social exclusion, family life Questionnaire 2: ISSP module (work orientation), soccer World Cup, democracy part 2
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Suppose we observe a random vector X from some distribution in a known family with unknown parameters. We ask the following question: when is it possible to split X into two pieces f(X) and g(X) such that neither part is sufficient to reconstruct X by itself, but both together can recover X fully, and their joint distribution is tractable? One common solution to this problem when multiple samples of X are observed is data splitting, but Rasines and Young offers an alternative approach that uses additive Gaussian noise—this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this article, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on several prototypical applications, such as post-selection inference for trend filtering and other regression problems, and effect size estimation after interactive multiple testing. Supplementary materials for this article are available online.