Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example calculation of distributional consistency (DC) using the parameters from the example data in Table 1.
This collection contains individual-level and 1-percent national sample data from the 1960 Census of Population and Housing conducted by the Census Bureau. It consists of a representative sample of the records from the 1960 sample questionnaires. The data are stored in 30 separate files, containing in total over two million records, organized by state. Some files contain the sampled records of several states while other files contain all or part of the sample for a single state. There are two types of records stored in the data files: one for households and one for persons. Each household record is followed by a variable number of person records, one for each of the household members. Data items in this collection include the individual responses to the basic social, demographic, and economic questions asked of the population in the 1960 Census of Population and Housing. Data are provided on household characteristics and features such as the number of persons in household, number of rooms and bedrooms, and the availability of hot and cold piped water, flush toilet, bathtub or shower, sewage disposal, and plumbing facilities. Additional information is provided on tenure, gross rent, year the housing structure was built, and value and location of the structure, as well as the presence of air conditioners, radio, telephone, and television in the house, and ownership of an automobile. Other demographic variables provide information on age, sex, marital status, race, place of birth, nationality, education, occupation, employment status, income, and veteran status. The data files were obtained by ICPSR from the Center for Social Analysis, Columbia University. (Source: downloaded from ICPSR 7/13/10)
Please Note: This dataset is part of the historical CISER Data Archive Collection and is also available at ICPSR at https://doi.org/10.3886/ICPSR07756.v1. We highly recommend using the ICPSR version as they may make this dataset available in multiple data formats in the future.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
If you are working with these files, please cite them as follows:Windhager, J., Zanotelli, V.R.T., Schulz, D. et al. An end-to-end workflow for multiplexed image processing and analysis. Nat Protoc (2023). https://doi.org/10.1038/s41596-023-00881-0 This repository contains additional information related to IMC example data available at zenodo.org/record/5949116. The following files are available and are part of the IMC Data Analysis workflow
gated_cells.zip: contains SpatialExperiment objects storing cells that were manually gated based on their expression values to derive ground truth cell phenotype labels. spe.rds: SpatialExperiment object containing the single-cell information (mean intensity per cell and per channel; cellular metadata; channel metadata) of the processed data. images.rds: CytoImageList object containing the spillover-corrected images. masks.rds: CytoImageList object containing the segmentation masks.
Within the frame of PCBS' efforts in providing official Palestinian statistics in the different life aspects of Palestinian society and because the wide spread of Computer, Internet and Mobile Phone among the Palestinian people, and the important role they may play in spreading knowledge and culture and contribution in formulating the public opinion, PCBS conducted the Household Survey on Information and Communications Technology, 2014.
The main objective of this survey is to provide statistical data on Information and Communication Technology in the Palestine in addition to providing data on the following: - Prevalence of computers and access to the Internet. - Study the penetration and purpose of Technology use.
Palestine (West Bank and Gaza Strip), type of locality (urban, rural, refugee camps) and governorate.
All Palestinian households and individuals whose usual place of residence in Palestine with focus on persons aged 10 years and over in year 2014.
Sample survey data [ssd]
Sampling Frame The sampling frame consists of a list of enumeration areas adopted in the Population, Housing and Establishments Census of 2007. Each enumeration area has an average size of about 124 households. These were used in the first phase as Preliminary Sampling Units in the process of selecting the survey sample.
Sample Size The total sample size of the survey was 7,268 households, of which 6,000 responded.
Sample Design The sample is a stratified clustered systematic random sample. The design comprised three phases:
Phase I: Random sample of 240 enumeration areas. Phase II: Selection of 25 households from each enumeration area selected in phase one using systematic random selection. Phase III: Selection of an individual (10 years or more) in the field from the selected households; KISH TABLES were used to ensure indiscriminate selection.
Sample Strata Distribution of the sample was stratified by: 1- Governorate (16 governorates, J1). 2- Type of locality (urban, rural and camps).
Face-to-face [f2f]
The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.
Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.
Section III: Data on persons (aged 10 years and over) about computer use, access to the Internet and possession of a mobile phone.
Preparation of Data Entry Program: This stage included preparation of the data entry programs using an ACCESS package and defining data entry control rules to avoid errors, plus validation inquiries to examine the data after it had been captured electronically.
Data Entry: The data entry process started on the 8th of May 2014 and ended on the 23rd of June 2014. The data entry took place at the main PCBS office and in field offices using 28 data clerks.
Editing and Cleaning procedures: Several measures were taken to avoid non-sampling errors. These included editing of questionnaires before data entry to check field errors, using a data entry application that does not allow mistakes during the process of data entry, and then examining the data by using frequency and cross tables. This ensured that data were error free; cleaning and inspection of the anomalous values were conducted to ensure harmony between the different questions on the questionnaire.
Response Rates: 79%
There are many aspects of the concept of data quality; this includes the initial planning of the survey to the dissemination of the results and how well users understand and use the data. There are three components to the quality of statistics: accuracy, comparability, and quality control procedures.
Checks on data accuracy cover many aspects of the survey and include statistical errors due to the use of a sample, non-statistical errors resulting from field workers or survey tools, and response rates and their effect on estimations. This section includes:
Statistical Errors Data of this survey may be affected by statistical errors due to the use of a sample and not a complete enumeration. Therefore, certain differences can be expected in comparison with the real values obtained through censuses. Variances were calculated for the most important indicators.
Variance calculations revealed that there is no problem in disseminating results nationally or regionally (the West Bank, Gaza Strip), but some indicators show high variance by governorate, as noted in the tables of the main report.
Non-Statistical Errors Non-statistical errors are possible at all stages of the project, during data collection or processing. These are referred to as non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, and practical and theoretical training took place during the training course. Training manuals were provided for each section of the questionnaire, along with practical exercises in class and instructions on how to approach respondents to reduce refused cases. Data entry staff were trained on the data entry program, which was tested before starting the data entry process.
Several measures were taken to avoid non-sampling errors. These included editing of questionnaires before data entry to check field errors, using a data entry application that does not allow mistakes during the process of data entry, and then examining the data by using frequency and cross tables. This ensured that data were error free; cleaning and inspection of the anomalous values were conducted to ensure harmony between the different questions on the questionnaire.
The sources of non-statistical errors can be summarized as: 1. Some of the households were not at home and could not be interviewed, and some households refused to be interviewed. 2. In unique cases, errors occurred due to the way the questions were asked by interviewers and respondents misunderstood some of the questions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set contains the replication data and supplements for the article "Knowing, Doing, and Feeling: A three-year, mixed-methods study of undergraduates’ information literacy development." The survey data is from two samples: - cross-sectional sample (different students at the same point in time) - longitudinal sample (the same students and different points in time)Surveys were distributed via Qualtrics during the students' first and sixth semesters. Quantitative and qualitative data were collected and used to describe students' IL development over 3 years. Statistics from the quantitative data were analyzed in SPSS. The qualitative data was coded and analyzed thematically in NVivo. The qualitative, textual data is from semi-structured interviews with sixth-semester students in psychology at UiT, both focus groups and individual interviews. All data were collected as part of the contact author's PhD research on information literacy (IL) at UiT. The following files are included in this data set: 1. A README file which explains the quantitative data files. (2 file formats: .txt, .pdf)2. The consent form for participants (in Norwegian). (2 file formats: .txt, .pdf)3. Six data files with survey results from UiT psychology undergraduate students for the cross-sectional (n=209) and longitudinal (n=56) samples, in 3 formats (.dat, .csv, .sav). The data was collected in Qualtrics from fall 2019 to fall 2022. 4. Interview guide for 3 focus group interviews. File format: .txt5. Interview guides for 7 individual interviews - first round (n=4) and second round (n=3). File format: .txt 6. The 21-item IL test (Tromsø Information Literacy Test = TILT), in English and Norwegian. TILT is used for assessing students' knowledge of three aspects of IL: evaluating sources, using sources, and seeking information. The test is multiple choice, with four alternative answers for each item. This test is a "KNOW-measure," intended to measure what students know about information literacy. (2 file formats: .txt, .pdf)7. Survey questions related to interest - specifically students' interest in being or becoming information literate - in 3 parts (all in English and Norwegian): a) information and questions about the 4 phases of interest; b) interest questionnaire with 26 items in 7 subscales (Tromsø Interest Questionnaire - TRIQ); c) Survey questions about IL and interest, need, and intent. (2 file formats: .txt, .pdf)8. Information about the assignment-based measures used to measure what students do in practice when evaluating and using sources. Students were evaluated with these measures in their first and sixth semesters. (2 file formats: .txt, .pdf)9. The Norwegain Centre for Research Data's (NSD) 2019 assessment of the notification form for personal data for the PhD research project. In Norwegian. (Format: .pdf)
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Evolutionary ecologists increasingly study reaction norms that are expressed repeatedly within the same individual's lifetime. For example, foragers continuously alter anti-predator vigilance in response to moment-to-moment changes in predation risk. Variation in this form of plasticity occurs both among and within individuals. Among-individual variation in plasticity (individual by environment interaction or I×E) is commonly studied; by contrast, despite increasing interest in its evolution and ecology, within-individual variation in phenotypic plasticity is not. We outline a study design based on repeated measures and a multi-level extension of random regression models that enables quantification of variation in reaction norms at different hierarchical levels (such as among- and within-individuals). The approach enables the calculation of repeatability of reaction norm intercepts (average phenotype) and slopes (level of phenotypic plasticity); these indices are not specific to measurement or scaling and are readily comparable across data sets. The proposed study design also enables calculation of repeatability at different temporal scales (such as short- and long-term repeatability) thereby answering calls for the development of approaches enabling scale-dependent repeatability calculations. We introduce a simulation package in the R statistical language to assess power, imprecision and bias for multi-level random regression that may be utilised for realistic datasets (unequal sample sizes across individuals, missing data, etc). We apply the idea to a worked example to illustrate its utility. We conclude that consideration of multi-level variation in reaction norms deepens our understanding of the hierarchical structuring of labile characters and helps reveal the biology in heterogeneous patterns of within-individual variance that would otherwise remain ‘unexplained’ residual variance.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
The 1981 Census Microdata Individual File for Great Britain: 5% Sample dataset was created from existing digital records from the 1981 Census under a project known as Enhancing and Enriching Historic Census Microdata Samples (EEHCM), which was funded by the Economic and Social Research Council with input from the Office for National Statistics and National Records of Scotland. The project ran from 2012-2014 and was led from the UK Data Archive, University of Essex, in collaboration with the Cathie Marsh Institute for Social Research (CMIST) at the University of Manchester and the Census Offices. In addition to the 1981 data, the team worked on files from the 1961 Census and 1971 Census.
The original 1981 records preceded current data archival standards and were created before microdata sets for secondary use were anticipated. A process of data recovery and quality checking was necessary to maximise their utility for current researchers, though some imperfections remain (see the User Guide for details). Three other 1981 Census datasets have been created:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Repeated measures correlation (rmcorr) is a statistical technique for determining the common within-individual association for paired measures assessed on two or more occasions for multiple individuals. Simple regression/correlation is often applied to non-independent observations or aggregated data; this may produce biased, specious results due to violation of independence and/or differing patterns between-participants versus within-participants. Unlike simple regression/correlation, rmcorr does not violate the assumption of independence of observations. Also, rmcorr tends to have much greater statistical power because neither averaging nor aggregation is necessary for an intra-individual research question. Rmcorr estimates the common regression slope, the association shared among individuals. To make rmcorr accessible, we provide background information for its assumptions and equations, visualization, power, and tradeoffs with rmcorr compared to multilevel modeling. We introduce the R package (rmcorr) and demonstrate its use for inferential statistics and visualization with two example datasets. The examples are used to illustrate research questions at different levels of analysis, intra-individual, and inter-individual. Rmcorr is well-suited for research questions regarding the common linear association in paired repeated measures data. All results are fully reproducible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract This paper contributes to the existing literature by reviewing the research methodology and the literature review with the focus on potential applications for the novelty technology of the single platform E-payment. These included, but were not restricted to the subjects, population, sample size requirement, data collection method and measurement of variables, pilot study and statistical techniques for data analysis. The reviews will shed some light and potential applications for future researchers, students and others to conceptualize, operationalize and analyze the underlying research methodology to assist in the development of their research methodology.
These data contain the results of GC-MS, LC-MS and immunochemistry analyses of mask sample extracts. The data include tentatively identified compounds through library searches and compound abundance. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The data can not be accessed. Format: The dataset contains the identification of compounds found in the mask samples as well as the abundance of those compounds for individuals who participated in the trial. This dataset is associated with the following publication: Pleil, J., M. Wallace, J. McCord, M. Madden, J. Sobus, and G. Ferguson. How do cancer-sniffing dogs sort biological samples? Exploring case-control samples with non-targeted LC-Orbitrap, GC-MS, and immunochemistry methods. Journal of Breath Research. Institute of Physics Publishing, Bristol, UK, 14(1): 016006, (2019).
By the middle of the 1990s, Indonesia had enjoyed over three decades of remarkable social, economic, and demographic change and was on the cusp of joining the middle-income countries. Per capita income had risen more than fifteenfold since the early 1960s, from around US$50 to more than US$800. Increases in educational attainment and decreases in fertility and infant mortality over the same period reflected impressive investments in infrastructure.
In the late 1990s the economic outlook began to change as Indonesia was gripped by the economic crisis that affected much of Asia. In 1998 the rupiah collapsed, the economy went into a tailspin, and gross domestic product contracted by an estimated 12-15%-a decline rivaling the magnitude of the Great Depression.
The general trend of several decades of economic progress followed by a few years of economic downturn masks considerable variation across the archipelago in the degree both of economic development and of economic setbacks related to the crisis. In part this heterogeneity reflects the great cultural and ethnic diversity of Indonesia, which in turn makes it a rich laboratory for research on a number of individual- and household-level behaviors and outcomes that interest social scientists.
The Indonesia Family Life Survey is designed to provide data for studying behaviors and outcomes. The survey contains a wealth of information collected at the individual and household levels, including multiple indicators of economic and non-economic well-being: consumption, income, assets, education, migration, labor market outcomes, marriage, fertility, contraceptive use, health status, use of health care and health insurance, relationships among co-resident and non- resident family members, processes underlying household decision-making, transfers among family members and participation in community activities. In addition to individual- and household-level information, the IFLS provides detailed information from the communities in which IFLS households are located and from the facilities that serve residents of those communities. These data cover aspects of the physical and social environment, infrastructure, employment opportunities, food prices, access to health and educational facilities, and the quality and prices of services available at those facilities. By linking data from IFLS households to data from their communities, users can address many important questions regarding the impact of policies on the lives of the respondents, as well as document the effects of social, economic, and environmental change on the population.
The Indonesia Family Life Survey complements and extends the existing survey data available for Indonesia, and for developing countries in general, in a number of ways.
First, relatively few large-scale longitudinal surveys are available for developing countries. IFLS is the only large-scale longitudinal survey available for Indonesia. Because data are available for the same individuals from multiple points in time, IFLS affords an opportunity to understand the dynamics of behavior, at the individual, household and family and community levels. In IFLS1 7,224 households were interviewed, and detailed individual-level data were collected from over 22,000 individuals. In IFLS2, 94.4% of IFLS1 households were re-contacted (interviewed or died). In IFLS3 the re-contact rate was 95.3% of IFLS1 households. Indeed nearly 91% of IFLS1 households are complete panel households in that they were interviewed in all three waves, IFLS1, 2 and 3. These re-contact rates are as high as or higher than most longitudinal surveys in the United States and Europe. High re-interview rates were obtained in part because we were committed to tracking and interviewing individuals who had moved or split off from the origin IFLS1 households. High re-interview rates contribute significantly to data quality in a longitudinal survey because they lessen the risk of bias due to nonrandom attrition in studies using the data.
Second, the multipurpose nature of IFLS instruments means that the data support analyses of interrelated issues not possible with single-purpose surveys. For example, the availability of data on household consumption together with detailed individual data on labor market outcomes, health outcomes and on health program availability and quality at the community level means that one can examine the impact of income on health outcomes, but also whether health in turn affects incomes.
Third, IFLS collected both current and retrospective information on most topics. With data from multiple points of time on current status and an extensive array of retrospective information about the lives of respondents, analysts can relate dynamics to events that occurred in the past. For example, changes in labor outcomes in recent years can be explored as a function of earlier decisions about schooling and work.
Fourth, IFLS collected extensive measures of health status, including self-reported measures of general health status, morbidity experience, and physical assessments conducted by a nurse (height, weight, head circumference, blood pressure, pulse, waist and hip circumference, hemoglobin level, lung capacity, and time required to repeatedly rise from a sitting position). These data provide a much richer picture of health status than is typically available in household surveys. For example, the data can be used to explore relationships between socioeconomic status and an array of health outcomes.
Fifth, in all waves of the survey, detailed data were collected about respondents¹ communities and public and private facilities available for their health care and schooling. The facility data can be combined with household and individual data to examine the relationship between, for example, access to health services (or changes in access) and various aspects of health care use and health status.
Sixth, because the waves of IFLS span the period from several years before the economic crisis hit Indonesia, to just prior to it hitting, to one year and then three years after, extensive research can be carried out regarding the living conditions of Indonesian households during this very tumultuous period. In sum, the breadth and depth of the longitudinal information on individuals, households, communities, and facilities make IFLS data a unique resource for scholars and policymakers interested in the processes of economic development.
National coverage
Sample survey data [ssd]
Because it is a longitudinal survey, the IFLS3 drew its sample from IFLS1, IFLS2, IFLS2+. The IFLS1 sampling scheme stratified on provinces and urban/rural location, then randomly sampled within these strata (see Frankenberg and Karoly, 1995, for a detailed description). Provinces were selected to maximize representation of the population, capture the cultural and socioeconomic diversity of Indonesia, and be cost-effective to survey given the size and terrain of the country. For mainly costeffectiveness reasons, 14 of the then existing 27 provinces were excluded. The resulting sample included 13 of Indonesia's 27 provinces containing 83% of the population: four provinces on Sumatra (North Sumatra, West Sumatra, South Sumatra, and Lampung), all five of the Javanese provinces (DKI Jakarta, West Java, Central Java, DI Yogyakarta, and East Java), and four provinces covering the remaining major island groups (Bali, West Nusa Tenggara, South Kalimantan, and South Sulawesi).
Household Survey:
Within each of the 13 provinces, enumeration areas (EAs) were randomly chosen from a nationally representative sample frame used in the 1993 SUSENAS, a socioeconomic survey of about 60,000 households. The IFLS randomly selected 321 enumeration areas in the 13 provinces, over-sampling urban EAs and EAs in smaller provinces to facilitate urban-rural and Javanese-non-Javanese comparisons.
Within a selected EA, households were randomly selected based upon 1993 SUSENAS listings obtained from regional BPS office. A household was defined as a group of people whose members reside in the same dwelling and share food from the same cooking pot (the standard BPS definition). Twenty households were selected from each urban EA, and 30 households were selected from each rural EA.This strategy minimized expensive travel between rural EAs while balancing the costs of correlations among households. For IFLS1 a total of 7,730 households were sampled to obtain a final sample size goal of 7,000 completed households. This strategy was based on BPS experience of about 90% completion rates. In fact, IFLS1 exceeded that target and interviews were conducted with 7,224 households in late 1993 and early 1994.
IFLS3 Re-Contact Protocols The sampling approach in IFLS3 was to re-contact all original IFLS1 households having living members the last time they had been contacted, plus split-off households from both IFLS2 and IFLS2+, so-called target households (8,347 households-as shown in Table 2.1*) Main field work for IFLS3 went on from June through November, 2000. A total of 10,574 households were contacted in 2000; meaning that they were interviewed, had all members died since the last time they were contacted, or had joined another IFLS household which had been previously interviewed (Table 2.1*). Of these, 7,928 were IFLS3 target households and 2,646 were new split-off households. A 95.0% re-contact rate was thus achieved of all IFLS3 "target" households. The re-contacted households included 6,800 original 1993 households, or 95.3% of those. Of IFLS1 households, somewhat lower re-contact rates were achieved in Jakarta, 84.5%, and North Sumatra,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MethodsThe objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst.Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.ResultsThe cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.ConclusionOur project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.
Topics covered in the 2021 UK Census included:
The 2021 Census: Safeguarded Individual Microdata Sample at Grouped Local Authority Level dataset consists of a random sample of 5% of person records from the 2021 Census. It includes records for 3,021,611 persons. These data cover England and Wales only. The lowest level of geography is grouped local authority. This means groups of local authorities or single local authorities where the population reaches at least 120,000 persons. The dataset contains 87 variables and a low level of detail.
Census Microdata
Microdata are small samples of individual records from a single census from which identifying information have been removed. They contain a range of individual and household characteristics and can be used to carry out analysis not possible from standard census outputs, such as:
The microdata samples are designed to protect the confidentiality of individuals and households. This is done by applying access controls and removing information that might directly identify a person, such as names, addresses and date of birth. Record swapping is applied to the census data used to create the microdata samples. This is a statistical disclosure control (SDC) method, which makes very small changes to the data to prevent the identification of individuals. The microdata samples use further SDC methods, such as collapsing variables and restricting detail. The samples also include records that have been edited to prevent inconsistent data and contain imputed persons, households, and data values. To protect confidentiality, imputation flags are not included in any 2021 Census microdata sample.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.
Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.
This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.
The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.
For more information:
NNDSS Supports the COVID-19 Response | CDC.
The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.
All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.
To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.
CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.
For questions, please contact Ask SRRG (eocevent394@cdc.gov).
COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
This dataset provides sample premium information for individual ACA-compliant health insurance plans available to Iowans for 2025 based on age, rating area and metal level. These are premiums for individuals, not families. Explore and drill into the data using the 2025 Sample Premium Explorer. Please note that not every plan ID is available in every county. On or after November 1, 2024, please go to www.healthcare.gov to determine if your plan is available in the county you reside in.
The UK censuses took place on 29th April 2001. They were run by the Northern Ireland Statistics & Research Agency (NISRA), General Register Office for Scotland (GROS), and the Office for National Statistics (ONS) for both England and Wales. The UK comprises the countries of England, Wales, Scotland and Northern Ireland.
Statistics from the UK censuses help paint a picture of the nation and how we live. They provide a detailed snapshot of the population and its characteristics, and underpin funding allocation to provide public services.
The 2001 Individual Licenced Sample of Anonymised Records for Imputation Analysis (I-SAR) is a 3% sample of individuals for all countries of the United Kingdom, with approximately 1.84 million records. The data are available for England, Wales, Scotland and Northern Ireland. Information is included for each individual on the main demographic, health, socio-economic and household variables. The 3% sample is an increase by comparison with 2% in 1991. Some variables have been broad-banded to reduce disclosure risk. The lowest level of geography is the Government Office Region (GOR), although Inner and Outer London are separately identified. This represents a significant reduction by comparison with the 1991 where large Local Authorities (population 120,000 and over) were separately identified.CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the survey of income and program participation (sipp) with r if the census bureau's budget was gutted and only one complex sample survey survived, pray it's the survey of income and program participation (sipp). it's giant. it's rich with variables. it's monthly. it follows households over three, four, now five year panels. the congressional budget office uses it for their health insurance simulation . analysts read that sipp has person-month files, get scurred, and retreat to inferior options. the american community survey may be the mount everest of survey data, but sipp is most certainly the amazon. questions swing wild and free through the jungle canopy i mean core data dictionary. legend has it that there are still species of topical module variables that scientists like you have yet to analyze. ponce de león would've loved it here. ponce. what a name. what a guy. the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households. once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over f our or five years (panel durations vary). you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups'). if you don't comprehend what's going on, try their survey design tutorial. since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period. the analysis scripts below provide examples of each. at every four-month interview point, every respondent answers every core question for the previous four months. after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month. to repeat: core wave files contain four records per person, topical modules contain one. if you stacked every core wave, you would have one record per person per month for the duration o f the panel. mmmassive. ~100,000 respondents x 12 months x ~4 years. have an analysis plan before you start writing code so you extract exactly what you need, nothing more. better yet, modify something of mine. cool? this new github repository contains eight, you read me, eight scripts: 1996 panel - download and create database.R 2001 panel - download and create database.R 2004 panel - download and create database.R 2008 panel - download and create database.R since some variables are character strings in one file and integers in anoth er, initiate an r function to harmonize variable class inconsistencies in the sas importation scripts properly handle the parentheses seen in a few of the sas importation scripts, because the SAScii package currently does not create an rsqlite database, initiate a variant of the read.SAScii
function that imports ascii data directly into a sql database (.db) download each microdata file - weights, topical modules, everything - then read 'em into sql 2008 panel - full year analysis examples.R< br /> define which waves and specific variables to pull into ram, based on the year chosen loop through each of twelve months, constructing a single-year temporary table inside the database read that twelve-month file into working memory, then save it for faster loading later if you like read the main and replicate weights columns into working memory too, merge everything construct a few annualized and demographic columns using all twelve months' worth of information construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined reproduce census-publish ed statistics, not precisely (due to topcoding described here on pdf page 19) 2008 panel - point-in-time analysis examples.R define which wave(s) and specific variables to pull into ram, based on the calendar month chosen read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory read the topical module and replicate weights files into working memory too, merge it like you mean it construct a few new, exciting variables using both core and topical module questions construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible. oh hayy, now it is. 2008 panel - median value of household assets.R define which wave(s) and spe cific variables to pull into ram, based on the topical module chosen read the topical module and replicate weights files into working memory too, merge once again construct a replicate-weighted complex sample design with a...
https://dataverse-staging.rdmc.unc.edu/api/datasets/:persistentId/versions/11.0/customlicense?persistentId=hdl:1902.29/11735https://dataverse-staging.rdmc.unc.edu/api/datasets/:persistentId/versions/11.0/customlicense?persistentId=hdl:1902.29/11735
The Russia Longitudinal Monitoring Survey (RLMS) is a series of nationally representative surveys designed to monitor the effects of Russian reforms on the health and economic welfare of households and individuals in the Russian Federation. These effects are measured by a variety of means: detailed monitoring of individuals' health status and dietary intake, precise measurement of household-level expenditures and service utilization, and collection of relevant community-level data, including region-specific prices and community infrastructure data. Phase II data have been collected annually (with two exceptions) since 1994. The project has been run jointly by the Carolina Population Center at the University of North Carolina at Chapel Hill, headed by Barry M. Popkin, and the Demoscope team in Russia, headed by Polina Kozyreva and Mikhail Kosolapov. Please note The sample size in 2014 was cut by about 20%, because the cost of the project increased due to inflation, but financial support remained the same. The original 1994 sample remained the same, and all cuts applied only to the part of the sample which was added in 2010. It should be stated that the implemented procedure of cutting the sample size guarantees that the smaller sample is still representative at the national level. To lower the cost it was also decided to dro p the Educational Expenses section from the HH questionnaire, which was added back in 2010. Household Data For the household interview, a single member of the household was asked questions that pertained to the entire family. The respondent was usually the oldest living woman in the home since she was available to be interviewed during the daytime. Any attempt to identify one person as the "household head" is as problematic in Russia as it is in the United States. Thus, the interviewer was instructed to speak with "the person who knows the most about this family's shop ping and health." Individual Data In theory, the individual questionnaire is administered to every person living in the household. In practice, however, some individuals, such as very young children and elderly people, did not receive an individual interview. Individual-level information is the primary source of information pertaining to a person's health, employment status, demographic characteristics, and anthropometry. It can also be used to supplement household-level income an d expenditure information. To safeguard the confidentiality of RLMS respondents, individual-level data sets omit text variables (designated char on questionnaires). Please note that almost all text variables exist in Russian only. English translations exist for only a few of these variables. Please contact us to check on the availability of English translations of specific variables of interest.
In 2024, the number of data compromises in the United States stood at 3,158 cases. Meanwhile, over 1.35 billion individuals were affected in the same year by data compromises, including data breaches, leakage, and exposure. While these are three different events, they have one thing in common. As a result of all three incidents, the sensitive data is accessed by an unauthorized threat actor. Industries most vulnerable to data breaches Some industry sectors usually see more significant cases of private data violations than others. This is determined by the type and volume of the personal information organizations of these sectors store. In 2024 the financial services, healthcare, and professional services were the three industry sectors that recorded most data breaches. Overall, the number of healthcare data breaches in some industry sectors in the United States has gradually increased within the past few years. However, some sectors saw decrease. Largest data exposures worldwide In 2020, an adult streaming website, CAM4, experienced a leakage of nearly 11 billion records. This, by far, is the most extensive reported data leakage. This case, though, is unique because cyber security researchers found the vulnerability before the cyber criminals. The second-largest data breach is the Yahoo data breach, dating back to 2013. The company first reported about one billion exposed records, then later, in 2017, came up with an updated number of leaked records, which was three billion. In March 2018, the third biggest data breach happened, involving India’s national identification database Aadhaar. As a result of this incident, over 1.1 billion records were exposed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example calculation of distributional consistency (DC) using the parameters from the example data in Table 1.