The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
This collection comprises survey data gathered in 2024 as part of a project aimed at investigating how synthetic data can support secure data access and improve research workflows, particularly from the perspective of data-owning organisations.
The survey targeted data-owning organisations across the UK, including those in government, academia and health sector. Respondents were individuals who could speak on behalf of their organisations, such as data managers, principal investigators, and information governance leads.
The motivation for this collection stemmed from the growing interest in synthetic data as a tool to enhance access to sensitive data and reduce pressure on Trusted Research Environments (TREs). The study explored organisational engagement with two types of synthetic data: synthetic data generated from real data, and “data-free” synthetic data created using metadata only.
The aims of the survey were to assess current practices, explore motivations and barriers to adoption, understand cost and governance models, and gather perspectives on scaling and outsourcing synthetic data production. Conditional logic was used to tailor the survey to organisations actively producing, planning, or not engaging with synthetic data.
This collection includes responses from 15 UK-based organisations. The survey covered eight core topics: organisational background, production practices, anticipated and realised benefits, technical and financial challenges, cost structures, data sharing models, scalability, and openness to external synthetic data generation.
The data offers exploratory insights into how UK organisations are approaching synthetic data in practice and can inform future research, infrastructure development, and policy guidance in this evolving area.
The findings have informed recommendations to support the responsible and efficient scaling of synthetic data production across sectors.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Synthetic Data Generation Marketsize was valued at USD 288.5 USD Million in 2023 and is projected to reach USD 1920.28 USD Million by 2032, exhibiting a CAGR of 31.1 % during the forecast period.Synthetic data generation stands for the generation of fake datasets that resemble real datasets with reference to their data distribution and patterns. It refers to the process of creating synthetic data points utilizing algorithms or models instead of conducting observations or surveys. There is one of its core advantages: it can maintain the statistical characteristics of the original data and remove the privacy risk of using real data. Further, with synthetic data, there is no limitation to how much data can be created, and hence, it can be used for extensive testing and training of machine learning models, unlike the case with conventional data, which may be highly regulated or limited in availability. It also helps in the generation of datasets that are comprehensive and include many examples of specific situations or contexts that may occur in practice for improving the AI system’s performance. The use of SDG significantly shortens the process of the development cycle, requiring less time and effort for data collection as well as annotation. It basically allows researchers and developers to be highly efficient in their discovery and development in specific domains like healthcare, finance, etc. Key drivers for this market are: Growing Demand for Data Privacy and Security to Fuel Market Growth. Potential restraints include: Lack of Data Accuracy and Realism Hinders Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.
Abstract copyright UK Data Service and data collection copyright owner. The Annual Survey of Hours and Earnings, 2020: Synthetic Data Pilot is a synthetic version of the Annual Survey of Hours and Earnings (ASHE) study available via Trusted Research Environments (TREs). ASHE is one of the most extensive surveys of the earnings of individuals in the UK. Data on the wages, paid hours of work, and pensions arrangements of nearly one per cent of the working population are collected. Other variables relating to age, occupation and industrial classification are also available. The ASHE sample is drawn from National Insurance records for working individuals, and the survey forms are sent to their respective employers to complete. ASHE is available for research projects demonstrating public good to accredited or approved researchers via TREs such as the Office for National Statistics Secure Research Service (SRS) or the UK Data Service Secure Lab (at SN 6689). To access collections stored within TREs, researchers need to undergo an accreditation process. Gaining access to data in a secure environment can be time and resource intensive. This pilot has created a low fidelity, low disclosure risk synthetic version of ASHE data, which can be made available to researchers more quickly while they wait for access to the real data.The synthetic data were created using the Synthpop package in R. The sample method was used; this takes a simple random sample with replacement from the real values. The project was carried out in the period between 19th December 2022 and 3rd January 2023. Further information is available within the documentation. User feedback received through this pilot will help the ONS to maximise benefits of data access and further explore the feasibility of synthesising more data in future. Main Topics: The ASHE synthetic data contain the same variables as ASHE for each individual, relating to wages, hours of work, pension arrangements, and occupation and industrial classifications. There are also variables for age, gender and full/part-time status. Because ASHE data are collected by the employer, there are also variables relating to the organisation employing the individual. These include employment size and legal status (e.g. public company). Various geography variables are included in the data files. The year variable in this synthetic dataset is 2020. Simple random sample Compilation/Synthesis
Large Language Models (LLMs) offer new research possibilities for social scientists, but their potential as “synthetic data" is still largely unknown. In this paper, we investigate how accurately the popular LLM ChatGPT can recover public opinion, prompting the LLM to adopt different “personas” and then provide feeling thermometer scores for 11 sociopolitical groups. The average scores generated by ChatGPT correspond closely to the averages in our baseline survey, the 2016–2020 American National Election Study. Nevertheless, sampling by ChatGPT is not reliable for statistical inference: there is less variation in responses than in the real surveys, and regression coefficients often differ significantly from equivalent estimates obtained using ANES data. We also document how the distribution of synthetic responses varies with minor changes in prompt wording, and we show how the same prompt yields significantly different results over a three-month period. Altogether, our findings raise serious concerns about the quality, reliability, and reproducibility of synthetic survey data generated by LLMs.
Please note: This is a Synthetic data file, also known as a Dummy File - it is NOT real data. This synthetic data file should not be used for purposes other than to develop and test computer programs that are to be submitted by remote access. Each record in the synthetic file matches the format and content parameters of the real Statistics Canada Master File with which it is associated, but the data themselves have been 'made up'. They do NOT represent responses from real individuals and should NOT be used for actual analysis. These data are provided solely for the purpose of testing statistical packing 'code' (e.g. SPSS syntax, SAS programs, etc.) in preparation for analysis using the associated Master File in a Research Data Centre, by Remote Job Submission, or by some other means of secure access. If statistical analysis 'code' works with the synthetic data, researchers can have some confidence that the same code will run successfully against the Master File data in the Research Data Centres. The Canadian Community Health Survey (CCHS) is a cross-sectional survey that collects information related to health status, health care utilization and health determinants for the Canadian population. Starting in 2007, the CCHS now operates using continuous collection. It is a large sample, general population health survey, designed to provide reliable estimates at the health region level. In order to provide researchers with a means to access the master file(s), a remote access facility has been implemented. Remote access provides researchers with the possibility to submit computer programs via e-mail to a dedicated address (cchs-escc@statcan.ca), and to receive the results by return e-mail. To obtain remote access privileges, it is necessary that researchers obtain advance approval from the Health Statistics Division. Requests must be submitted to the aforementioned e-mail address and must provide the following, clearly itemized information: •the researcher’s affiliation, • the name of all researchers involved in the project, • the title of the research project, • an abstract of the project, • the goals of the research, • the data to which access is required (survey, cycle), • why the project requires access to the master data rather than the PUMF, • why Remote Access service is chosen rather the on-site access in a Research Data Centre (RDC), • the expected results, and • the project’s expected completion date. Further information is available by contacting the CCHS team at the above e-mail address or by phone at (613) 951-1653. Once the request for remote access has been approved, the researcher can submit his/her computer programs to the CCHS team for processing on the master file(s). The computer output is reviewed by the team for confidentiality concerns and returned to the researcher. However, the correctness and accuracy of each program submission remains, at all times, the sole responsibility of the researcher.
The aim of this project was to create a synthetic dataset without using the original (secure, controlled) dataset to do so, and instead using only publicly available analytical output (i.e. output that was cleared for publication) to create the synthetic data. Such synthetic data may allow users to gain familiarity with and practise on data that is like the original before they gain access to the original data (where time in a secure setting may be limited).
The Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data was created without access to the original ASHE-2011 Census dataset (which is only available in a secure setting via the ONS Secure Research Service: "Annual Survey of Hours and Earnings linked to 2011 Census - England and Wales"). It was created as a teaching aid to support a training course "An Introduction to the linked ASHE-2011 Census dataset" organised by Administrative Data Research UK and the National Centre for Research Methods. The synthetic dataset contains a subset of the variables in the original dataset and was designed to reproduce the analytical output contained in the ASHE-Census 2011 Data Linkage User Guide.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distance of GPT samples to the party-means. The distance of each synthetic sample to the corresponding party-mean is compared to the distance of each candidate to their respective party-mean. The mean and standard deviation of those distributions of distances are averaged across all questions for each party separately. The p-value corresponds to Welch’s t-test with the null hypothesis that GPT samples and candidates have equal distance to the party-mean.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Abstract copyright UK Data Service and data collection copyright owner. The aim of this project was to create a synthetic dataset without using the original (secure, controlled) dataset to do so, and instead using only publicly available analytical output (i.e. output that was cleared for publication) to create the synthetic data. Such synthetic data may allow users to gain familiarity with and practise on data that is like the original before they gain access to the original data (where time in a secure setting may be limited).The Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data was created without access to the original ASHE-2011 Census dataset (which is only available in a secure setting via the ONS Secure Research Service: "Annual Survey of Hours and Earnings linked to 2011 Census - England and Wales"). It was created as a teaching aid to support a training course "An Introduction to the linked ASHE-2011 Census dataset" organised by Administrative Data Research UK and the National Centre for Research Methods. The synthetic dataset contains a subset of the variables in the original dataset and was designed to reproduce the analytical output contained in the ASHE-Census 2011 Data Linkage User Guide. Main Topics: Variables available in this study relate to synthetic employment, earnings and demographic information for adults employed in England and Wales in 2011. Synthetic sample generated by a computer algorithm Compilation/Synthesis
The lack of a centralised and comprehensive register-based system in Great Britain limits opportunities for studying the interaction of aspects such as health, employment, benefit payments, or housing quality at the level of individuals and households. At the same time, the data that exist are typically strictly controlled and only available in safe haven environments under a 'create-and-destroy' model. In particular, when testing policy options via simulation models where results are required swiftly, these limitations can present major hurdles to coproduction and collaborative work connecting researchers, policymakers, and key stakeholders. In some cases, survey data can provide a suitable alternative to the lack of readily available administrative data. However, survey data does typically not allow for a small-area perspective. Although Special Licence area-level linkages of survey data can offer more detailed spatial information, the data coverage and statistical power might be too low for meaningful analysis.
As the SIPHER Synthetic Population is the outcome of a statistical creation process, all results obtained from this dataset should always be treated as 'model output', including basic descriptive statistics. Here, the SIPHER Synthetic Population should not replace the underlying Understanding Society survey data for standard statistical analyses (e.g., standard regression analysis, or longitudinal multi-wave analysis). Please see the User Guide provided for this dataset for further information on creation and validation.
This research was conducted as part of the https://www.gla.ac.uk/research/az/sipher/">Systems Science in Public Health and Health Economics Research (SIPHER) Consortium and we thank the whole team for valuable input and discussions which have informed this work.
The 2023 Popstan Synthetic Household Survey is a periodic national welfare monitoring survey conducted by the Popstan National Statistics Office. It is used to update the national poverty profile, based on the (indexed) national poverty line calculated in 2017.
National (all 10 regions)
Household, Individual
Resident population with exception of homeless, nomads, and residents in institutional households
Sample survey data [ssd]
A stratified sample was drawn. The urban/rural areas of each province (geo1) were used as strata. The sample of 8,000 households was selected proportional to the size of each geo1. In each stratum, we randomly enumeration areas (Eas), and in each EA we randomly
The response rate was 100%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The synthetic population was generated from the 2010-2014 ACS PUMS housing and person files.
United States Department of Commerce. Bureau of the Census. (2017-03-06).
American Community Survey 2010-2014 ACS 5-Year PUMS File [Data set].
Ann Arbor, MI: Inter-university Consortium of Political and Social
Research [distributor]. http://doi.org/10.3886/E100486V1
Outputs
There are 17 housing files
- repHus0.csv, repHus1.csv, ... repHus16.csv
and 32 person files
- rep_recode_ACSpus0.csv, rep_recode_ACSpus1.csv, ... rep_recode_ACSpus31.csv.
Files are split to be roughly equal in size. The files contain data for the entire country. Files are not split along any demographic characteristic. The person files and housing files must be concatenated to form a complete person file and a complete housing file, respectively.
If desired, person and housing records should be merged on 'id'. Variable description is below.
Data Dictionary
See [2010-2014 ACS PUMS data dictionary](http://doi.org/10.3886/E100486V1). All variables from the ACS PUMS housing files are present in the synthetic housing files and all variables from the ACS PUMS person files are present in the synthetic person files. Variables have not been modified in any way. Theoretically, variables like `person weight` no longer have any use in the synthetic population.
See README.md for more details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial intelligence (AI) and machine learning (ML) tools are now proliferating in biomedical contexts, and there is no sign this will slow down any time soon. AI/ML and related technologies promise to improve scientific understanding of health and disease and have the potential to spur the development of innovative and effective diagnostics, treatments, cures, and medical technologies. Concerns about AI/ML are prominent, but attention to two specific aspects of AI/ML have so far received little research attention: synthetic data and computational checklists that might promote not only the reproducibility of AI/ML tools but also increased attention to ethical, legal, and social implications (ELSI) of AI/ML tools. We administered a targeted survey to explore these two items among biomedical professionals in the United States. Our survey findings suggest that there is a gap in familiarity with both synthetic data and computational checklists among AI/ML users and developers and those in ethics-related positions who might be tasked with ensuring the proper use or oversight of AI/ML tools. The findings from this survey study underscore the need for additional ELSI research on synthetic data and computational checklists to inform escalating efforts, including the establishment of laws and policies, to ensure safe, effective, and ethical use of AI in health settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
France Consumer Survey: sa: Consumer Synthetic Index data was reported at 91.517 % Point in Mar 2025. This records a decrease from the previous number of 93.461 % Point for Feb 2025. France Consumer Survey: sa: Consumer Synthetic Index data is updated monthly, averaging 100.308 % Point from Jan 1987 (Median) to Mar 2025, with 459 observations. The data reached an all-time high of 126.236 % Point in Jan 2001 and a record low of 79.867 % Point in Jun 2013. France Consumer Survey: sa: Consumer Synthetic Index data remains active status in CEIC and is reported by National Institute of Statistics and Economic Studies. The data is categorized under Global Database’s France – Table FR.H032: Consumer Survey. [COVID-19-IMPACT]
This resource includes the necessary codes to generate a synthetic dataset of all crimes that occurred in each output area in England and Wales in 2011. Counts of violence, property crime and criminal damage can be generated, and three different approaches to counting crime are possible - synthetic data of all crimes, synthetic data of police recorded crimes, synthetic data of survey estimated crimes.
Having generated the crime counts at output area, they can be aggregated to any spatial scale of interest.
Crime counts are synthesised by predicting invidual victimisation propensities using the Crime Survey for England and Wales (2011), then mapping these propensities on to individuals (and households) based on population counts from the UK census.
Data are synthetic. The following steps were followed to generate a synthetic dataset of crimes in England and Wales: 1. Download Census data aggregates at the Output Area level under a Open Government Licence 2. Download microdata of the Crime Survey for England and Wales (CSEW) 2011/12 from the UK Data Service. 3. Generate a synthetic population of residents (or households) in Output Areas based on empirical parameters observed in Census data and covariance matrix observed in CSEW 4. Based on parameters from the CSEW 2011/12, generate crimes (violence, property crime and damage) reported within each unit in the synthetic population 5. Based on parameters from the CSEW 2011/12, predict if each crime generated in Step 4 is known to, and recorded by, the police or not (this will be the synthetic dataset of police-recorded crimes) 6. Draw a random sample of units from the synthetic population following sampling design of the CSEW (this will be the synthetic dataset of crimes recorded by the CSEW) This generates three sets of synthetic crime data, which can be then compared at the different spatial scales: i) 'synthetic_population_crimes.RData': synthetic data of all crime - split in 7 files (Generated in Step 4) ii) 'synthetic_police_crimes.RData': synthetic data of police-recorded crime (Generated in Step 5) iii) 'synthetic_survey_crimes.RData': synthetic data of survey-recorded crime (Generated in Step 6)
Small area estimates of self assessed health and limiting long-term illness developed using multilevel small area estimation methodologies. Estimates are for varying combinations of England, Wales and Scotland and are at the middle layer super output area (or equivalent). Comparisons with 2011 census data are facilitated. Estimates were developed using the Health Surveys for England, Wales and Scotland, and using the Crime Survey for England and Wales. Publications and working papers to date are available from the Project website (see 'additional resources')
3x2pt synthetic data generated according to the IST:L recipe for CLOE v2.0
Cosmological and nuisance parameters and survey specifics as detailed here
Files in the ASCII .dat format are the ones read by CLOE
Files in the binary .npy format are for additional tests
The data set consists of time, depth, reflection coefficient synthetic, sonic velocity, density, and amplitude used to create synthetic seismogram for Water Treatment Plant RO, G-2945, (DZMW-1) in Broward County, Florida.
The deposit contains a dataset created for the paper, 'Many Models in R: A Tutorial'. ncds.Rds is an R format synthetic dataset created with the synthpop dataset in R using data from the National Child Development Study (NCDS), a birth cohort of individuals born in a single week of March 1958 in Britain. The dataset contains data on fourteen biomarkers collected at the age 46/47 sweep of the survey, four measures of cognitive ability from age 11 and 16, and three covariates, sex, body mass index at age 11 and father's social class. The data is only intended to be used in the tutorial - it is not to be used for drawing statistical inferences.This project contains data used in the paper, "Many Models in R: A Tutorial". The data are a simplified, synthetic and imputed version of the National Child Development Study. There are variables for 14 biomarkers from the age 46/47 biomedical survey, 4 measures of cognitive ability from tests at ages 11 and 16, and 3 covariates (sex, father's socioeconomic class and BMI at age 11). The data were originally collected by interview and nurse assessment. For information about the creation of the synthetic data please check "Data sourcing, processing and preparation" and the user guide.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.