Facebook
TwitterThe Project for Statistics on Living standards and Development was a coutrywide World Bank Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect statistical information about the conditions under which South Africans live in order to provide policymakers with the data necessary for planning strategies. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.
National coverage
All Household members.
Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described above for the households in ESDs.
Sample survey data [ssd]
Sample size is 9,000 households
The sample design adopted for the study was a two-stage self-weightingdesign in which the first stage units were Census Enumerator Subdistricts (ESDs, or their equivalent) and the second stage were households.
The advantage of using such a design is that it provides a representative sample that need not be based on accurate census population distribution.in the case of South Africa, the sample will automatically include many poor people, without the need to go beyond this and oversample the poor. Proportionate sampling as in such a self-weighting sample design offers the simplest possible data files for further analysis, as weights do not have to be added. However, in the end this advantage could not be retained and weights had to be added.
The sampling frame was drawn up on the basis of small, clearly demarcated area units, each with a population estimate. The nature of the self-weighting procedure adopted ensured that this population estimate was not important for determining the final sample, however. For most of the country, census ESDs were used. Where some ESDs comprised relatively large populations as for instance in some black townships such as Soweto, aerial photographs were used to divide the areas into blocks of approximately equal population size. In other instances, particularly in some of the former homelands, the area units were not ESDs but villages or village groups.
In the sample design chosen, the area stage units (generally ESDs) were selected with probability proportional to size, based on the census population. Systematic sampling was used throughout that is, sampling at fixed interval in a list of ESDs, starting at a randomly selected starting point. Given that sampling was self-weighting, the impact of stratification was expected to be modest. The main objective was to ensure that the racial and geographic breakdown approximated the national population distribution. This was done by listing the area stage units (ESDs) by statistical region and then within the statistical region by urban or rural. Within these sub-statistical regions, the ESDs were then listed in order of percentage African. The sampling interval for the selection of the ESDs was obtained by dividing the 1991 census population of 38,120,853 by the 300 clusters to be selected. This yielded 105,800. Starting at a randomly selected point, every 105,800th person down the cluster list was selected. This ensured both geographic and racial diversity (ESDs were ordered by statistical sub-region and proportion of the population African). In three or four instances, the ESD chosen was judged inaccessible and replaced with a similar one.
In the second sampling stage the unit of analysis was the household. In each selected ESD a listing or enumeration of households was carried out by means of a field operation. From the households listed in an ESD a sample of households was selected by systematic sampling. Even though the ultimate enumeration unit was the household, in most cases "stands" were used as enumeration units. However, when a stand was chosen as the enumeration unit all households on that stand had to be interviewed.
Census population data, however, was available only for 1991. An assumption on population growth was thus made to obtain an approximation of the population size for 1993, the year of the survey. The sampling interval at the level of the household was determined in the following way: Based on the decision to have a take of 125 individuals on average per cluster (i.e. assuming 5 members per household to give an average cluster size of 25 households), the interval of households to be selected was determined as the census population divided by 118.1, i.e. allowing for population growth since the census. It was subsequently discovered that population growth was slightly over-estimated but this had little effect on the findings of the survey.
Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described abovefor the households in ESDs.
Face-to-face [f2f]
The main instrument used in the survey was a comprehensive household questionnaire. This questionnaire covered a wide range of topics but was not intended to provide exhaustive coverage of any single subject. In other words, it was an integrated questionnaire aimed at capturing different aspects of living standards. The topics covered included demography, household services, household expenditure, educational status and expenditure, remittances and marital maintenance, land access and use, employment and income, health status and expenditure and anthropometry (children under the age of six were weighed and their heights measured). This questionnaire was available to households in two languages, namely English and Afrikaans. In addition, interviewers had in their possession a translation in the dominant African language/s of the region.
In addition to the detailed household questionnaire referred to above, a community questionnaire was administered in each cluster of the sample. The purpose of this questionnaire was to elicit information on the facilities available to the community in each cluster. Questions related primarily to the provision of education, health and recreational facilities. Furthermore there was a detailed section for the prices of a range of commodities from two retail sources in or near the cluster: a formal source such as a supermarket and a less formal one such as the "corner cafe" or a "spaza". The purpose of this latter section was to obtain a measure of regional price variation both by region and by retail source. These prices were obtained by the interviewer. For the questions relating to the provision of facilities, respondents were "prominent" members of the community such as school principals, priests and chiefs.
All the questionnaires were checked when received. Where information was incomplete or appeared contradictory, the questionnaire was sent back to the relevant survey organization. As soon as the data was available, it was captured using local development platform ADE. This was completed in February 1994. Following this, a series of exploratory programs were written to highlight inconsistencies and outlier. For example, all person level files were linked together to ensure that the same person code reported in different sections of the questionnaire corresponded to the same person. The error reports from these programs were compared to the questionnaires and the necessary alterations made. This was a lengthy process, as several files were checked more than once, and completed at the beginning of August 1994. In some cases questionnaires would contain missing values, or comments that the respondent did not know, or refused to answer a question.
These responses are coded in the data files with the following values: VALUE MEANING -1 : The data was not available on the questionnaire or form -2 : The field is not applicable -3 : Respondent refused to answer -4 : Respondent did not know answer to question
The data collected in clusters 217 and 218 should be viewed as highly unreliable and therefore removed from the data set. The data currently available on the web site has been revised to remove the data from these clusters. Researchers who have downloaded the data in the past should revise their data sets. For information on the data in those clusters, contact SALDRU http://www.saldru.uct.ac.za/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Students in statistics, data science, analytics, and related fields study the theory and methodology of data-related topics. Some, but not all, are exposed to experiential learning courses that cover essential parts of the life cycle of practical problem-solving. Experiential learning enables students to convert real-world issues into solvable technical questions and effectively communicate their findings to clients. We describe several experiential learning course designs in statistics, data science, and analytics curricula. We present findings from interviews with faculty from the U.S., Europe, and the Middle East and surveys of former students. We observe that courses featuring live projects and coaching by experienced faculty have a high career impact, as reported by former participants. However, such courses are labor-intensive for both instructors and students. We give estimates of the required effort to deliver courses with live projects and the perceived benefits and tradeoffs of such courses. Overall, we conclude that courses offering live-project experiences, despite being more time-consuming than traditional courses, offer significant benefits for students regarding career impact and skill development, making them worthwhile investments. Supplementary materials for this article are available online.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This extensive dataset comprises approximately 50,000 academic papers along with their corresponding metadata, designed to facilitate various natural language processing (NLP) tasks such as classification and retrieval. The dataset covers a diverse range of research domains, including but not limited to computer science, biology, social sciences, engineering, and more. The list of all categories can be found here. With its comprehensive collection of academic papers and enriched metadata, this dataset serves as a valuable resource for researchers and data enthusiasts interested in advancing NLP applications in the academic domain.
Metadata: The dataset includes essential metadata for each paper, such as the publish date, title, summary/abstract, author(s), and category. The metadata is meticulously curated to ensure accuracy and consistency, enabling researchers to swiftly extract valuable insights and conduct exploratory data analysis.
Vast Paper Collection: With nearly 50,000 academic papers, this dataset encompasses a broad spectrum of research topics and domains, making it suitable for a wide range of NLP tasks, including but not limited to document classification, topic modeling, and document retrieval.
Application Flexibility: The dataset is meticulously preprocessed and annotated, making it adaptable for various NLP applications. Researchers and practitioners can use it for tasks like sentiment analysis, keyword extraction, and more.
Document Classification: Leverage this dataset to build powerful classifiers capable of categorizing academic papers into relevant research domains or topics. This can aid in automated content organization and information retrieval.
Document Retrieval: Develop efficient retrieval models that can quickly identify and retrieve relevant papers based on user queries or specific keywords. Such models can streamline the research process and assist researchers in finding relevant literature faster.
Topic Modeling: Use this dataset to perform topic modeling and extract meaningful topics or themes present within the academic papers. This can provide valuable insights into the prevailing research trends and interests within different disciplines.
Recommendation Systems: Employ the dataset to build personalized recommendation systems that suggest relevant papers to researchers based on their previous interests or research focus.
We would like to express our gratitude to the authors and publishers of the academic papers included in this dataset for their valuable contributions to the research community. By making this dataset publicly available, we hope to foster advancements in natural language processing and support data-driven research across diverse domains.
As the curators of this dataset, we have made every effort to ensure the accuracy and quality of the data. However, we cannot guarantee the absolute correctness of the information or the suitability of the dataset for any specific purpose. Users are encouraged to exercise their judgment and discretion while utilizing the dataset for their research projects.
We sincerely hope that this dataset proves to be a valuable resource for the NLP community and contributes to the development of innovative solutions in academic research and beyond. Happy analyzing and modeling!
Facebook
TwitterUnderserved communities, especially those in coastal areas in Puerto Rico, face significant threats from natural hazards such as hurricanes and rising sea levels. Limited funding hinders the investment in costly mitigation measures, increasing exposure to natural disasters. Providing coastal resources and data products through effective communication mechanisms is fundamental to improving the well-being of these underserved coastal communities. The overall objectives of the pilot effort to engage and connect with underrepresented coastal communities in Puerto Rico were the following: (1) compile a comprehensive database of the projects and resources relevant to natural hazards in Puerto Rico; (2) foster connections with Puerto Rican interested parties to better understand their priorities regarding coastal hazards and provide them with pertinent U.S. Geological Survey (USGS) resources; and (3) identify knowledge gaps to guide future USGS projects in Puerto Rico. To address these objectives, the research team held a virtual internal meeting amongst USGS colleagues (organized with a professional facilitator) to identify and gather information on existing USGS data, knowledge, and tools available for natural hazards and resources in Puerto Rico. The goals of the meeting were to: (1) exchange knowledge among colleagues, (2) broaden the network of participants, (3) foster potential collaborative relationships with researchers engaged in USGS hazards projects in Puerto Rico, and (4) document all the research taking place in Puerto Rico related to natural hazards and resources. The result was a database of USGS natural hazards projects being conducted or recently completed in Puerto Rico. For further information about this data, refer to the associated journal article (Torres-García and others, 2024).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Demographic Analysis of Shopping Behavior: Insights and Recommendations
Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.
Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.
Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.
Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.
Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.
References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/
Facebook
TwitterTopic Modeling for Research Articles Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
Given the abstract and title for a set of research articles, predict the topics for each article included in the test set.
Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:
Computer Science
Physics
Mathematics
Statistics
Quantitative Biology
Quantitative Finance
| Column | Description |
|---|---|
| ID | Unique ID for each article |
| TITLE | Title of the research article |
| ABSTRACT | Abstract of the research article |
| Computer Science | Whether article belongs to topic computer science (1/0) |
| Physics | Whether article belongs to topic physics (1/0) |
| Mathematics | Whether article belongs to topic Mathematics (1/0) |
| Statistics | Whether article belongs to topic Statistics (1/0) |
| Quantitative Biology | Whether article belongs to topic Quantitative Biology (1/0) |
| Quantitative Finance | Whether article belongs to topic Quantitative Finance (1/0) |
| ID | Unique ID for each article |
|---|---|
| TITLE | Title of the research article |
| ABSTRACT | Abstract of the research article |
| ID | Unique ID for each article |
|---|---|
| TITLE | Title of the research article |
| ABSTRACT | Abstract of the research article |
| Computer Science | Whether article belongs to topic computer science (1/0) |
| Physics | Whether article belongs to topic physics (1/0) |
| Mathematics | Whether article belongs to topic Mathematics (1/0) |
| Statistics | Whether article belongs to topic Statistics (1/0) |
| Quantitative Biology | Whether article belongs to topic Quantitative Biology (1/0) |
| Quantitative Finance | Whether article belongs to topic Quantitative Finance (1/0) |
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterFinancial overview and grant giving statistics of Metro Ideas Project
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
According to the regulations for the operation of the Ministry of Economic Affairs' commissioned research projects in the field of science and technology, the themes and key points of the company's commissioned research for the year are announced for the participation of the industry and academia.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The challenge of detecting research topics in a specific research field has attracted attention from researchers in the bibliometrics community. In this study, to solve two problems of clustering papers, i.e., the influence of different distributions of citation links and involved textual features on similarity computation, the authors propose a hybrid self-optimized clustering model to detect research topics by extending the hybrid clustering model to identify “core documents”. First, the Amsler network, consisting of bibliographic coupling and co-citation links, is created to calculate the citation-based similarity based on the cosine angle of papers. Second, the cosine similarity is also used to compute the text-based similarity, which consists of the textual statistical and topological features. Then, the cosine angle of the linear combination of citation- and text-based similarity is considered as the hybrid similarity. Finally, the Louvain method is applied to cluster papers, and the terms based on term frequency are used to label clusters. To test the performance of the proposed model, a dataset related to the data envelopment analysis field is used for comparison and analysis of clustering results. Based on the benchmark built, different clustering methods with different citation links or textual features are compared according to evaluation measures. The results show that the proposed model can obtain reasonable and effective clustering results, and the research topics of data envelopment analysis field are also analyzed based on the proposed model. As different features are considered in the proposed model compared with previous hybrid clustering models, the proposed clustering model can provide inspiration for further studies on topic identification by other researchers.
Facebook
TwitterPurpose: This map contains project data for the Arches recreational hot spot study, PIN 16097, for the Arches Hotspot Preliminary Project Ideas App 2018 study and is embedded within that storymap. It illustrates proposed parking, cycling trail, and other recreational transportation projects.The data was completed in 2018 by Jones and DeMille Engineers. For questions on the data, please contact Adam Perschon at adam.p@jonesanddemille.com. It was transferred ownership from Paul Damron to Bracken on 6/23/23.Go Live Date: January 2018Project PIN: 16097ePM Project Name: Moab Area Recreational Hot Spot StudyOwner: Bracken Davis (bdavis1@utah.gov)Update Interval: One-time creation.Data Location: MoabHotspotStudy hosted feature layer.Associated Apps: Arches Hotspot Preliminary Project storymapUDOT Region 4 - Arches Hotspot Improvement Projects 2018 storymapUDOT Region 4 - Arches Hotspot Additional Study Information 2018 storymapExpected Life of Data:There is no foreseeable end date for this data.
Facebook
TwitterThis dataset contains a curated collection of 1300+ Final Year Projects (FYPs) gathered from multiple sources.
Each entry includes detailed information such as project title, abstract, domain, technologies used, year of development, and the original source URL.
It is designed to support students, researchers, and educators in exploring project ideas across a wide variety of domains and technologies.
The dataset is organized into the following columns:
| title | abstract | domain | technologies | year | source_url |
|---|---|---|---|---|---|
| AI-Powered Smart Home Energy Optimizer | An IoT system that uses machine learning to analyze energy consumption patterns and automatically control appliances. | IoT | Python, TensorFlow, Raspberry Pi, MQTT, Sensors | 2023 | https://example.com |
License: For academic and research use only.
Facebook
Twitterhttp://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
This dataset contains information about projects and their results funded by the European Union under the Horizon 2020 framework programme for research and innovation from 2014 to 2020.
The dataset is composed of six (6) different sub-set (in different formats):
Reference data (programmes, topics, topic keywords funding schemes (types of action), organisation types and countries) can be found in this dataset: https://data.europa.eu/euodp/en/data/dataset/cordisref-data
EuroSciVoc is available here: https://data.europa.eu/data/datasets/euroscivoc-the-european-science-vocabulary
CORDIS datasets are produced monthly. Therefore, inconsistencies may occur between what is presented on the CORDIS live website and the datasets.
Facebook
TwitterSharing research data provides benefit to the general scientific community, but the benefit is less obvious for the investigator who makes his or her data available. We examined the citation history of 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data.
Facebook
TwitterAcademy of Program/Project & Engineering Leadership's Ask the Academy magazine past issues.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of baseline covariates and standardized bias before and after PS adjusted using weighting by the odds in 20% of the total respondents, a cross sectional study in five cities, china, 2007–2008 (n = 3,179). (PDF)
Facebook
TwitterAcademy of Program/Project & Engineering Leadership's ASK Magazine archive.
Facebook
TwitterWe create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.
Facebook
TwitterThe global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
TwitterHealthcare Cost and Utilization Project (HCUP) Fast Stats provides easy access to the latest HCUP-based statistics for health care information topics. HCUP Fast Stats uses visual statistical displays in stand-alone graphs, trend figures, or simple tables to convey complex information at a glance. Fast Stats is updated regularly for timely, topic-specific national and State-level statistics. Fast Stats topics and graphics on hospital stays and emergency department visits, including information at the national, and state levels, trends over time, and selected priority topics such as: State Trends in Hospital User by Payer National Hospital Utilization and Costs Hurricane Impact on Hospital Use Opioids & Neonatal Abstinence Syndrome Severe Maternal Morbidity
Facebook
TwitterThe Participation Survey started in October 2021 and is the key evidence source on engagement for DCMS. It is a continuous push-to-web household survey of adults aged 16 and over in England.
The Participation Survey provides nationally representative estimates of physical and digital engagement with the arts, heritage, museums & galleries, libraries and archives, as well as engagement with tourism, major events, live sports and digital.
The Participation Survey is only asked of adults in England. Currently there is no harmonised survey or set of questions within the administrations of the UK. Data on participation in cultural sectors for the devolved administrations is available in the https://www.gov.scot/collections/scottish-household-survey/">Scottish Household Survey, https://gov.wales/national-survey-wales">National Survey for Wales and https://www.communities-ni.gov.uk/topics/statistics-and-research/culture-and-heritage-statistics">Northern Ireland Continuous Household Survey.
The pre-release access document above contains a list of ministers and officials who have received privileged early access to this release of Participation Survey data. In line with best practice, the list has been kept to a minimum and those given access for briefing purposes had a maximum of 24 hours. Details on the pre-release access arrangements for this dataset are available in the accompanying material.
Our statistical practice is regulated by the OSR. OSR sets the standards of trustworthiness, quality and value in the https://code.statisticsauthority.gov.uk/the-code/">Code of Practice for Statistics that all producers of official statistics should adhere to.
You are welcome to contact us directly with any comments about how we meet these standards by emailing evidence@dcms.gov.uk. Alternatively, you can contact OSR by emailing regulation@statistics.gov.uk or via the OSR website.
The responsible statisticians for this release is Donilia Asgill. For enquiries on this release, contact participationsurvey@dcms.gov.uk.
Facebook
TwitterThe Project for Statistics on Living standards and Development was a coutrywide World Bank Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect statistical information about the conditions under which South Africans live in order to provide policymakers with the data necessary for planning strategies. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.
National coverage
All Household members.
Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described above for the households in ESDs.
Sample survey data [ssd]
Sample size is 9,000 households
The sample design adopted for the study was a two-stage self-weightingdesign in which the first stage units were Census Enumerator Subdistricts (ESDs, or their equivalent) and the second stage were households.
The advantage of using such a design is that it provides a representative sample that need not be based on accurate census population distribution.in the case of South Africa, the sample will automatically include many poor people, without the need to go beyond this and oversample the poor. Proportionate sampling as in such a self-weighting sample design offers the simplest possible data files for further analysis, as weights do not have to be added. However, in the end this advantage could not be retained and weights had to be added.
The sampling frame was drawn up on the basis of small, clearly demarcated area units, each with a population estimate. The nature of the self-weighting procedure adopted ensured that this population estimate was not important for determining the final sample, however. For most of the country, census ESDs were used. Where some ESDs comprised relatively large populations as for instance in some black townships such as Soweto, aerial photographs were used to divide the areas into blocks of approximately equal population size. In other instances, particularly in some of the former homelands, the area units were not ESDs but villages or village groups.
In the sample design chosen, the area stage units (generally ESDs) were selected with probability proportional to size, based on the census population. Systematic sampling was used throughout that is, sampling at fixed interval in a list of ESDs, starting at a randomly selected starting point. Given that sampling was self-weighting, the impact of stratification was expected to be modest. The main objective was to ensure that the racial and geographic breakdown approximated the national population distribution. This was done by listing the area stage units (ESDs) by statistical region and then within the statistical region by urban or rural. Within these sub-statistical regions, the ESDs were then listed in order of percentage African. The sampling interval for the selection of the ESDs was obtained by dividing the 1991 census population of 38,120,853 by the 300 clusters to be selected. This yielded 105,800. Starting at a randomly selected point, every 105,800th person down the cluster list was selected. This ensured both geographic and racial diversity (ESDs were ordered by statistical sub-region and proportion of the population African). In three or four instances, the ESD chosen was judged inaccessible and replaced with a similar one.
In the second sampling stage the unit of analysis was the household. In each selected ESD a listing or enumeration of households was carried out by means of a field operation. From the households listed in an ESD a sample of households was selected by systematic sampling. Even though the ultimate enumeration unit was the household, in most cases "stands" were used as enumeration units. However, when a stand was chosen as the enumeration unit all households on that stand had to be interviewed.
Census population data, however, was available only for 1991. An assumption on population growth was thus made to obtain an approximation of the population size for 1993, the year of the survey. The sampling interval at the level of the household was determined in the following way: Based on the decision to have a take of 125 individuals on average per cluster (i.e. assuming 5 members per household to give an average cluster size of 25 households), the interval of households to be selected was determined as the census population divided by 118.1, i.e. allowing for population growth since the census. It was subsequently discovered that population growth was slightly over-estimated but this had little effect on the findings of the survey.
Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described abovefor the households in ESDs.
Face-to-face [f2f]
The main instrument used in the survey was a comprehensive household questionnaire. This questionnaire covered a wide range of topics but was not intended to provide exhaustive coverage of any single subject. In other words, it was an integrated questionnaire aimed at capturing different aspects of living standards. The topics covered included demography, household services, household expenditure, educational status and expenditure, remittances and marital maintenance, land access and use, employment and income, health status and expenditure and anthropometry (children under the age of six were weighed and their heights measured). This questionnaire was available to households in two languages, namely English and Afrikaans. In addition, interviewers had in their possession a translation in the dominant African language/s of the region.
In addition to the detailed household questionnaire referred to above, a community questionnaire was administered in each cluster of the sample. The purpose of this questionnaire was to elicit information on the facilities available to the community in each cluster. Questions related primarily to the provision of education, health and recreational facilities. Furthermore there was a detailed section for the prices of a range of commodities from two retail sources in or near the cluster: a formal source such as a supermarket and a less formal one such as the "corner cafe" or a "spaza". The purpose of this latter section was to obtain a measure of regional price variation both by region and by retail source. These prices were obtained by the interviewer. For the questions relating to the provision of facilities, respondents were "prominent" members of the community such as school principals, priests and chiefs.
All the questionnaires were checked when received. Where information was incomplete or appeared contradictory, the questionnaire was sent back to the relevant survey organization. As soon as the data was available, it was captured using local development platform ADE. This was completed in February 1994. Following this, a series of exploratory programs were written to highlight inconsistencies and outlier. For example, all person level files were linked together to ensure that the same person code reported in different sections of the questionnaire corresponded to the same person. The error reports from these programs were compared to the questionnaires and the necessary alterations made. This was a lengthy process, as several files were checked more than once, and completed at the beginning of August 1994. In some cases questionnaires would contain missing values, or comments that the respondent did not know, or refused to answer a question.
These responses are coded in the data files with the following values: VALUE MEANING -1 : The data was not available on the questionnaire or form -2 : The field is not applicable -3 : Respondent refused to answer -4 : Respondent did not know answer to question
The data collected in clusters 217 and 218 should be viewed as highly unreliable and therefore removed from the data set. The data currently available on the web site has been revised to remove the data from these clusters. Researchers who have downloaded the data in the past should revise their data sets. For information on the data in those clusters, contact SALDRU http://www.saldru.uct.ac.za/.