100+ datasets found
  1. i

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • nada-demo.ihsn.org
    • microdata.worldbank.org
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2024). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://nada-demo.ihsn.org/index.php/catalog/135
    Explore at:
    Dataset updated
    Nov 1, 2024
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World, World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  2. Data from: Organisational Readiness and Perceptions of Synthetic Data...

    • beta.ukdataservice.ac.uk
    Updated 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UK Data Service (2025). Organisational Readiness and Perceptions of Synthetic Data Production and Dissemination in the UK: Survey Data, 2024 [Dataset]. http://doi.org/10.5255/ukda-sn-857756
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    DataCitehttps://www.datacite.org/
    UK Data Servicehttps://ukdataservice.ac.uk/
    Area covered
    United Kingdom
    Description

    This collection comprises survey data gathered in 2024 as part of a project aimed at investigating how synthetic data can support secure data access and improve research workflows, particularly from the perspective of data-owning organisations.

    The survey targeted data-owning organisations across the UK, including those in government, academia and health sector. Respondents were individuals who could speak on behalf of their organisations, such as data managers, principal investigators, and information governance leads.

    The motivation for this collection stemmed from the growing interest in synthetic data as a tool to enhance access to sensitive data and reduce pressure on Trusted Research Environments (TREs). The study explored organisational engagement with two types of synthetic data: synthetic data generated from real data, and “data-free” synthetic data created using metadata only.

    The aims of the survey were to assess current practices, explore motivations and barriers to adoption, understand cost and governance models, and gather perspectives on scaling and outsourcing synthetic data production. Conditional logic was used to tailor the survey to organisations actively producing, planning, or not engaging with synthetic data.

    This collection includes responses from 15 UK-based organisations. The survey covered eight core topics: organisational background, production practices, anticipated and realised benefits, technical and financial challenges, cost structures, data sharing models, scalability, and openness to external synthetic data generation.

    The data offers exploratory insights into how UK organisations are approaching synthetic data in practice and can inform future research, infrastructure development, and policy guidance in this evolving area.

    The findings have informed recommendations to support the responsible and efficient scaling of synthetic data production across sectors.

  3. S

    Synthetic Data Generation Market Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Synthetic Data Generation Market Report [Dataset]. https://www.marketresearchforecast.com/reports/synthetic-data-generation-market-1834
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Synthetic Data Generation Marketsize was valued at USD 288.5 USD Million in 2023 and is projected to reach USD 1920.28 USD Million by 2032, exhibiting a CAGR of 31.1 % during the forecast period.Synthetic data generation stands for the generation of fake datasets that resemble real datasets with reference to their data distribution and patterns. It refers to the process of creating synthetic data points utilizing algorithms or models instead of conducting observations or surveys. There is one of its core advantages: it can maintain the statistical characteristics of the original data and remove the privacy risk of using real data. Further, with synthetic data, there is no limitation to how much data can be created, and hence, it can be used for extensive testing and training of machine learning models, unlike the case with conventional data, which may be highly regulated or limited in availability. It also helps in the generation of datasets that are comprehensive and include many examples of specific situations or contexts that may occur in practice for improving the AI system’s performance. The use of SDG significantly shortens the process of the development cycle, requiring less time and effort for data collection as well as annotation. It basically allows researchers and developers to be highly efficient in their discovery and development in specific domains like healthcare, finance, etc. Key drivers for this market are: Growing Demand for Data Privacy and Security to Fuel Market Growth. Potential restraints include: Lack of Data Accuracy and Realism Hinders Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.

  4. e

    Annual Survey of Hours and Earnings, 2020: Synthetic Data Pilot - Dataset -...

    • b2find.eudat.eu
    Updated Jan 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Annual Survey of Hours and Earnings, 2020: Synthetic Data Pilot - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/32fc8aba-6526-57a7-b9b6-07cd76f5eb4b
    Explore at:
    Dataset updated
    Jan 16, 2024
    Description

    Abstract copyright UK Data Service and data collection copyright owner. The Annual Survey of Hours and Earnings, 2020: Synthetic Data Pilot is a synthetic version of the Annual Survey of Hours and Earnings (ASHE) study available via Trusted Research Environments (TREs). ASHE is one of the most extensive surveys of the earnings of individuals in the UK. Data on the wages, paid hours of work, and pensions arrangements of nearly one per cent of the working population are collected. Other variables relating to age, occupation and industrial classification are also available. The ASHE sample is drawn from National Insurance records for working individuals, and the survey forms are sent to their respective employers to complete. ASHE is available for research projects demonstrating public good to accredited or approved researchers via TREs such as the Office for National Statistics Secure Research Service (SRS) or the UK Data Service Secure Lab (at SN 6689). To access collections stored within TREs, researchers need to undergo an accreditation process. Gaining access to data in a secure environment can be time and resource intensive. This pilot has created a low fidelity, low disclosure risk synthetic version of ASHE data, which can be made available to researchers more quickly while they wait for access to the real data.The synthetic data were created using the Synthpop package in R. The sample method was used; this takes a simple random sample with replacement from the real values. The project was carried out in the period between 19th December 2022 and 3rd January 2023. Further information is available within the documentation. User feedback received through this pilot will help the ONS to maximise benefits of data access and further explore the feasibility of synthesising more data in future. Main Topics: The ASHE synthetic data contain the same variables as ASHE for each individual, relating to wages, hours of work, pension arrangements, and occupation and industrial classifications. There are also variables for age, gender and full/part-time status. Because ASHE data are collected by the employer, there are also variables relating to the organisation employing the individual. These include employment size and legal status (e.g. public company). Various geography variables are included in the data files. The year variable in this synthetic dataset is 2020. Simple random sample Compilation/Synthesis

  5. d

    Replication Data for: Synthetic Replacements for Human Survey Data? The...

    • dataone.org
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bisbee, James (2024). Replication Data for: Synthetic Replacements for Human Survey Data? The Perils of Large Language Models [Dataset]. http://doi.org/10.7910/DVN/VPN481
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Bisbee, James
    Description

    Large Language Models (LLMs) offer new research possibilities for social scientists, but their potential as “synthetic data" is still largely unknown. In this paper, we investigate how accurately the popular LLM ChatGPT can recover public opinion, prompting the LLM to adopt different “personas” and then provide feeling thermometer scores for 11 sociopolitical groups. The average scores generated by ChatGPT correspond closely to the averages in our baseline survey, the 2016–2020 American National Election Study. Nevertheless, sampling by ChatGPT is not reliable for statistical inference: there is less variation in responses than in the real surveys, and regression coefficients often differ significantly from equivalent estimates obtained using ANES data. We also document how the distribution of synthetic responses varies with minor changes in prompt wording, and we show how the same prompt yields significantly different results over a three-month period. Altogether, our findings raise serious concerns about the quality, reliability, and reproducibility of synthetic survey data generated by LLMs.

  6. d

    Synthetic: Canadian Community Health Survey, 2009: Sub-Sample File [Canada]

    • search.dataone.org
    Updated Dec 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Health Statistics Division (2023). Synthetic: Canadian Community Health Survey, 2009: Sub-Sample File [Canada] [Dataset]. http://doi.org/10.5683/SP3/30OXE0
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Health Statistics Division
    Description

    Please note: This is a Synthetic data file, also known as a Dummy File - it is NOT real data. This synthetic data file should not be used for purposes other than to develop and test computer programs that are to be submitted by remote access. Each record in the synthetic file matches the format and content parameters of the real Statistics Canada Master File with which it is associated, but the data themselves have been 'made up'. They do NOT represent responses from real individuals and should NOT be used for actual analysis. These data are provided solely for the purpose of testing statistical packing 'code' (e.g. SPSS syntax, SAS programs, etc.) in preparation for analysis using the associated Master File in a Research Data Centre, by Remote Job Submission, or by some other means of secure access. If statistical analysis 'code' works with the synthetic data, researchers can have some confidence that the same code will run successfully against the Master File data in the Research Data Centres. The Canadian Community Health Survey (CCHS) is a cross-sectional survey that collects information related to health status, health care utilization and health determinants for the Canadian population. Starting in 2007, the CCHS now operates using continuous collection. It is a large sample, general population health survey, designed to provide reliable estimates at the health region level. In order to provide researchers with a means to access the master file(s), a remote access facility has been implemented. Remote access provides researchers with the possibility to submit computer programs via e-mail to a dedicated address (cchs-escc@statcan.ca), and to receive the results by return e-mail. To obtain remote access privileges, it is necessary that researchers obtain advance approval from the Health Statistics Division. Requests must be submitted to the aforementioned e-mail address and must provide the following, clearly itemized information: •the researcher’s affiliation, • the name of all researchers involved in the project, • the title of the research project, • an abstract of the project, • the goals of the research, • the data to which access is required (survey, cycle), • why the project requires access to the master data rather than the PUMF, • why Remote Access service is chosen rather the on-site access in a Research Data Centre (RDC), • the expected results, and • the project’s expected completion date. Further information is available by contacting the CCHS team at the above e-mail address or by phone at (613) 951-1653. Once the request for remote access has been approved, the researcher can submit his/her computer programs to the CCHS team for processing on the master file(s). The computer output is reviewed by the team for confidentiality concerns and returned to the researcher. However, the correctness and accuracy of each program submission remains, at all times, the sole responsibility of the researcher.

  7. Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data

    • beta.ukdataservice.ac.uk
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C. Little; M. Elliott; R. Allmendinger (2025). Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data [Dataset]. http://doi.org/10.5255/ukda-sn-9282-1
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Office for National Statistics
    Authors
    C. Little; M. Elliott; R. Allmendinger
    Description

    The aim of this project was to create a synthetic dataset without using the original (secure, controlled) dataset to do so, and instead using only publicly available analytical output (i.e. output that was cleared for publication) to create the synthetic data. Such synthetic data may allow users to gain familiarity with and practise on data that is like the original before they gain access to the original data (where time in a secure setting may be limited).

    The Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data was created without access to the original ASHE-2011 Census dataset (which is only available in a secure setting via the ONS Secure Research Service: "Annual Survey of Hours and Earnings linked to 2011 Census - England and Wales"). It was created as a teaching aid to support a training course "An Introduction to the linked ASHE-2011 Census dataset" organised by Administrative Data Research UK and the National Centre for Research Methods. The synthetic dataset contains a subset of the variables in the original dataset and was designed to reproduce the analytical output contained in the ASHE-Census 2011 Data Linkage User Guide.

  8. f

    Distance of GPT samples to the party-means. The distance of each synthetic...

    • figshare.com
    xls
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fynn Bachmann; Daan van der Weijden; Lucien Heitz; Cristina Sarasua; Abraham Bernstein (2025). Distance of GPT samples to the party-means. The distance of each synthetic sample to the corresponding party-mean is compared to the distance of each candidate to their respective party-mean. The mean and standard deviation of those distributions of distances are averaged across all questions for each party separately. The p-value corresponds to Welch’s t-test with the null hypothesis that GPT samples and candidates have equal distance to the party-mean. [Dataset]. http://doi.org/10.1371/journal.pone.0322690.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Fynn Bachmann; Daan van der Weijden; Lucien Heitz; Cristina Sarasua; Abraham Bernstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distance of GPT samples to the party-means. The distance of each synthetic sample to the corresponding party-mean is compared to the distance of each candidate to their respective party-mean. The mean and standard deviation of those distributions of distances are averaged across all questions for each party separately. The p-value corresponds to Welch’s t-test with the null hypothesis that GPT samples and candidates have equal distance to the party-mean.

  9. e

    Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data -...

    • b2find.eudat.eu
    Updated Nov 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bb3fbfa2-6888-5657-8f8e-4ed9877d24e3
    Explore at:
    Dataset updated
    Nov 9, 2024
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Abstract copyright UK Data Service and data collection copyright owner. The aim of this project was to create a synthetic dataset without using the original (secure, controlled) dataset to do so, and instead using only publicly available analytical output (i.e. output that was cleared for publication) to create the synthetic data. Such synthetic data may allow users to gain familiarity with and practise on data that is like the original before they gain access to the original data (where time in a secure setting may be limited).The Annual Survey of Hours and Earnings 2011 and Census 2011: Synthetic Data was created without access to the original ASHE-2011 Census dataset (which is only available in a secure setting via the ONS Secure Research Service: "Annual Survey of Hours and Earnings linked to 2011 Census - England and Wales"). It was created as a teaching aid to support a training course "An Introduction to the linked ASHE-2011 Census dataset" organised by Administrative Data Research UK and the National Centre for Research Methods. The synthetic dataset contains a subset of the variables in the original dataset and was designed to reproduce the analytical output contained in the ASHE-Census 2011 Data Linkage User Guide. Main Topics: Variables available in this study relate to synthetic employment, earnings and demographic information for adults employed in England and Wales in 2011. Synthetic sample generated by a computer algorithm Compilation/Synthesis

  10. u

    SIPHER Synthetic Population for Individuals in Great Britain, 2019-2021

    • beta.ukdataservice.ac.uk
    Updated 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    N. Lomax; A. Hoehn; A. Heppenstall; R. Purshouse; G. Wu; K. Zia; P. Meier (2025). SIPHER Synthetic Population for Individuals in Great Britain, 2019-2021 [Dataset]. http://doi.org/10.5255/ukda-sn-9277-1
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    datacite
    University of Essex, Institute for Social and Economic Research
    Authors
    N. Lomax; A. Hoehn; A. Heppenstall; R. Purshouse; G. Wu; K. Zia; P. Meier
    Area covered
    Great Britain, United Kingdom
    Description
    The SIPHER Synthetic Population allows for the creation of a survey-based full-scale synthetic population for all of Great Britain, through a linkage with the UK Household Longitudinal Study (UKDS SN 6614, Understanding Society, wave k). By drawing on data reflecting 'real' survey respondents, the dataset represents over 50 million synthetic (i.e. 'not real') individuals. As a digital twin of the adult population in Great Britain, the SIPHER Synthetic Population provides a novel source of microdata for understanding 'status quo' and modelling 'what if' scenarios (e.g., via static/dynamic microsimulation model), as well as other exploratory analyses where a granular geographical resolution is required.

    The lack of a centralised and comprehensive register-based system in Great Britain limits opportunities for studying the interaction of aspects such as health, employment, benefit payments, or housing quality at the level of individuals and households. At the same time, the data that exist are typically strictly controlled and only available in safe haven environments under a 'create-and-destroy' model. In particular, when testing policy options via simulation models where results are required swiftly, these limitations can present major hurdles to coproduction and collaborative work connecting researchers, policymakers, and key stakeholders. In some cases, survey data can provide a suitable alternative to the lack of readily available administrative data. However, survey data does typically not allow for a small-area perspective. Although Special Licence area-level linkages of survey data can offer more detailed spatial information, the data coverage and statistical power might be too low for meaningful analysis.

    As the SIPHER Synthetic Population is the outcome of a statistical creation process, all results obtained from this dataset should always be treated as 'model output', including basic descriptive statistics. Here, the SIPHER Synthetic Population should not replace the underlying Understanding Society survey data for standard statistical analyses (e.g., standard regression analysis, or longitudinal multi-wave analysis). Please see the User Guide provided for this dataset for further information on creation and validation.

    This research was conducted as part of the https://www.gla.ac.uk/research/az/sipher/">Systems Science in Public Health and Health Economics Research (SIPHER) Consortium and we thank the whole team for valuable input and discussions which have informed this work.

  11. i

    Popstan Synthetic Household Survey 2023 - Popstan

    • nada-demo.ihsn.org
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Statistics Office (NSO) (2025). Popstan Synthetic Household Survey 2023 - Popstan [Dataset]. https://nada-demo.ihsn.org/index.php/catalog/139
    Explore at:
    Dataset updated
    Feb 12, 2025
    Dataset authored and provided by
    National Statistics Office (NSO)
    Time period covered
    2023
    Area covered
    Popstan
    Description

    Abstract

    The 2023 Popstan Synthetic Household Survey is a periodic national welfare monitoring survey conducted by the Popstan National Statistics Office. It is used to update the national poverty profile, based on the (indexed) national poverty line calculated in 2017.

    Geographic coverage

    National (all 10 regions)

    Analysis unit

    Household, Individual

    Universe

    Resident population with exception of homeless, nomads, and residents in institutional households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A stratified sample was drawn. The urban/rural areas of each province (geo1) were used as strata. The sample of 8,000 households was selected proportional to the size of each geo1. In each stratum, we randomly enumeration areas (Eas), and in each EA we randomly

    Response rate

    The response rate was 100%.

  12. Synthetic population housing and person records for the United States

    • zenodo.org
    • openicpsr.org
    • +2more
    zip
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Sexton; John M. Abowd; Ian M. Schmutte; Lars Vilhuber; William Sexton; John M. Abowd; Ian M. Schmutte; Lars Vilhuber (2023). Synthetic population housing and person records for the United States [Dataset]. http://doi.org/10.5281/zenodo.556121
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    William Sexton; John M. Abowd; Ian M. Schmutte; Lars Vilhuber; William Sexton; John M. Abowd; Ian M. Schmutte; Lars Vilhuber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    The synthetic population was generated from the 2010-2014 ACS PUMS housing and person files.

    United States Department of Commerce. Bureau of the Census. (2017-03-06).
    American Community Survey 2010-2014 ACS 5-Year PUMS File [Data set].
    Ann Arbor, MI: Inter-university Consortium of Political and Social
    Research [distributor]. http://doi.org/10.3886/E100486V1

    Outputs

    There are 17 housing files
    - repHus0.csv, repHus1.csv, ... repHus16.csv
    and 32 person files
    - rep_recode_ACSpus0.csv, rep_recode_ACSpus1.csv, ... rep_recode_ACSpus31.csv.

    Files are split to be roughly equal in size. The files contain data for the entire country. Files are not split along any demographic characteristic. The person files and housing files must be concatenated to form a complete person file and a complete housing file, respectively.

    If desired, person and housing records should be merged on 'id'. Variable description is below.

    Data Dictionary
    See [2010-2014 ACS PUMS data dictionary](http://doi.org/10.3886/E100486V1). All variables from the ACS PUMS housing files are present in the synthetic housing files and all variables from the ACS PUMS person files are present in the synthetic person files. Variables have not been modified in any way. Theoretically, variables like `person weight` no longer have any use in the synthetic population.

    See README.md for more details.

  13. f

    Survey data file with analysis.

    • plos.figshare.com
    xlsx
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jennifer K. Wagner; Laura Y. Cabrera; Sara Gerke; Daniel Susser (2024). Survey data file with analysis. [Dataset]. http://doi.org/10.1371/journal.pdig.0000666.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Jennifer K. Wagner; Laura Y. Cabrera; Sara Gerke; Daniel Susser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artificial intelligence (AI) and machine learning (ML) tools are now proliferating in biomedical contexts, and there is no sign this will slow down any time soon. AI/ML and related technologies promise to improve scientific understanding of health and disease and have the potential to spur the development of innovative and effective diagnostics, treatments, cures, and medical technologies. Concerns about AI/ML are prominent, but attention to two specific aspects of AI/ML have so far received little research attention: synthetic data and computational checklists that might promote not only the reproducibility of AI/ML tools but also increased attention to ethical, legal, and social implications (ELSI) of AI/ML tools. We administered a targeted survey to explore these two items among biomedical professionals in the United States. Our survey findings suggest that there is a gap in familiarity with both synthetic data and computational checklists among AI/ML users and developers and those in ethics-related positions who might be tasked with ensuring the proper use or oversight of AI/ML tools. The findings from this survey study underscore the need for additional ELSI research on synthetic data and computational checklists to inform escalating efforts, including the establishment of laws and policies, to ensure safe, effective, and ethical use of AI in health settings.

  14. F

    France Consumer Survey: sa: Consumer Synthetic Index

    • ceicdata.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2025). France Consumer Survey: sa: Consumer Synthetic Index [Dataset]. https://www.ceicdata.com/en/france/consumer-survey/consumer-survey-sa-consumer-synthetic-index
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 1, 2024 - Feb 1, 2025
    Area covered
    France
    Variables measured
    Consumer Survey
    Description

    France Consumer Survey: sa: Consumer Synthetic Index data was reported at 91.517 % Point in Mar 2025. This records a decrease from the previous number of 93.461 % Point for Feb 2025. France Consumer Survey: sa: Consumer Synthetic Index data is updated monthly, averaging 100.308 % Point from Jan 1987 (Median) to Mar 2025, with 459 observations. The data reached an all-time high of 126.236 % Point in Jan 2001 and a record low of 79.867 % Point in Jun 2013. France Consumer Survey: sa: Consumer Synthetic Index data remains active status in CEIC and is reported by National Institute of Statistics and Economic Studies. The data is categorized under Global Database’s France – Table FR.H032: Consumer Survey. [COVID-19-IMPACT]

  15. u

    Synthetic Dataset of Crimes in England and Wales

    • beta.ukdataservice.ac.uk
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UK Data Service (2024). Synthetic Dataset of Crimes in England and Wales [Dataset]. http://doi.org/10.5255/ukda-sn-857314
    Explore at:
    Dataset updated
    2024
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Area covered
    England, Wales
    Description

    This resource includes the necessary codes to generate a synthetic dataset of all crimes that occurred in each output area in England and Wales in 2011. Counts of violence, property crime and criminal damage can be generated, and three different approaches to counting crime are possible - synthetic data of all crimes, synthetic data of police recorded crimes, synthetic data of survey estimated crimes.

    Having generated the crime counts at output area, they can be aggregated to any spatial scale of interest.

    Crime counts are synthesised by predicting invidual victimisation propensities using the Crime Survey for England and Wales (2011), then mapping these propensities on to individuals (and households) based on population counts from the UK census.

  16. e

    Synthetic Dataset of Crimes in England and Wales - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Synthetic Dataset of Crimes in England and Wales - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/fa8ea27a-6ccc-5bdc-b8cd-e68f307f8994
    Explore at:
    Dataset updated
    Nov 10, 2020
    Area covered
    England, Wales
    Description

    Data are synthetic. The following steps were followed to generate a synthetic dataset of crimes in England and Wales: 1. Download Census data aggregates at the Output Area level under a Open Government Licence 2. Download microdata of the Crime Survey for England and Wales (CSEW) 2011/12 from the UK Data Service. 3. Generate a synthetic population of residents (or households) in Output Areas based on empirical parameters observed in Census data and covariance matrix observed in CSEW 4. Based on parameters from the CSEW 2011/12, generate crimes (violence, property crime and damage) reported within each unit in the synthetic population 5. Based on parameters from the CSEW 2011/12, predict if each crime generated in Step 4 is known to, and recorded by, the police or not (this will be the synthetic dataset of police-recorded crimes) 6. Draw a random sample of units from the synthetic population following sampling design of the CSEW (this will be the synthetic dataset of crimes recorded by the CSEW) This generates three sets of synthetic crime data, which can be then compared at the different spatial scales: i) 'synthetic_population_crimes.RData': synthetic data of all crime - split in 7 files (Generated in Step 4) ii) 'synthetic_police_crimes.RData': synthetic data of police-recorded crime (Generated in Step 5) iii) 'synthetic_survey_crimes.RData': synthetic data of survey-recorded crime (Generated in Step 6)

  17. Estimating census health geographies: using synthetic estimation with...

    • eprints.soton.ac.uk
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moon, Graham; Twigg, Liz; Taylor, Joanna; Aitken, Grant (2025). Estimating census health geographies: using synthetic estimation with secondary survey and census data [Dataset]. http://doi.org/10.5255/UKDA-SN-851972
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    UK Data Archivehttp://data-archive.ac.uk/
    Authors
    Moon, Graham; Twigg, Liz; Taylor, Joanna; Aitken, Grant
    Description

    Small area estimates of self assessed health and limiting long-term illness developed using multilevel small area estimation methodologies. Estimates are for varying combinations of England, Wales and Scotland and are at the middle layer super output area (or equivalent). Comparisons with 2011 census data are facilitated. Estimates were developed using the Health Surveys for England, Wales and Scotland, and using the Crime Survey for England and Wales. Publications and working papers to date are available from the Project website (see 'additional resources')

  18. 3x2pt Synthetic Data Vectors

    • zenodo.org
    Updated Oct 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davide Sciotti; Davide Sciotti (2023). 3x2pt Synthetic Data Vectors [Dataset]. http://doi.org/10.5281/zenodo.10002142
    Explore at:
    Dataset updated
    Oct 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Davide Sciotti; Davide Sciotti
    Description

    3x2pt synthetic data generated according to the IST:L recipe for CLOE v2.0

    Cosmological and nuisance parameters and survey specifics as detailed here

    Files in the ASCII .dat format are the ones read by CLOE

    Files in the binary .npy format are for additional tests

  19. d

    G-2945: Synthetic Seismogram Data for Correlation Between Seismic-Reflection...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). G-2945: Synthetic Seismogram Data for Correlation Between Seismic-Reflection Profiles and Well Data, Broward County, Florida [Dataset]. https://catalog.data.gov/dataset/g-2945-synthetic-seismogram-data-for-correlation-between-seismic-reflection-profiles-and-w
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Broward County, Florida
    Description

    The data set consists of time, depth, reflection coefficient synthetic, sonic velocity, density, and amplitude used to create synthetic seismogram for Water Treatment Plant RO, G-2945, (DZMW-1) in Broward County, Florida.

  20. e

    Many Models in R: A Tutorial - National Child Development Study: Age 46,...

    • b2find.eudat.eu
    Updated Apr 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Many Models in R: A Tutorial - National Child Development Study: Age 46, Sweep 7, 2004-2005: Synthetic Data, 2023 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/6a73a99c-af03-5b85-8dac-44d022221f15
    Explore at:
    Dataset updated
    Apr 7, 2024
    Description

    The deposit contains a dataset created for the paper, 'Many Models in R: A Tutorial'. ncds.Rds is an R format synthetic dataset created with the synthpop dataset in R using data from the National Child Development Study (NCDS), a birth cohort of individuals born in a single week of March 1958 in Britain. The dataset contains data on fourteen biomarkers collected at the age 46/47 sweep of the survey, four measures of cognitive ability from age 11 and 16, and three covariates, sex, body mass index at age 11 and father's social class. The data is only intended to be used in the tutorial - it is not to be used for drawing statistical inferences.This project contains data used in the paper, "Many Models in R: A Tutorial". The data are a simplified, synthetic and imputed version of the National Child Development Study. There are variables for 14 biomarkers from the age 46/47 biomedical survey, 4 measures of cognitive ability from tests at ages 11 and 16, and 3 covariates (sex, father's socioeconomic class and BMI at age 11). The data were originally collected by interview and nurse assessment. For information about the creation of the synthetic data please check "Data sourcing, processing and preparation" and the user guide.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Development Data Group, Data Analytics Unit (2024). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://nada-demo.ihsn.org/index.php/catalog/135

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 1, 2024
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description

Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.

Search
Clear search
Close search
Google apps
Main menu