100+ datasets found
  1. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  2. i

    Dataset of article: Synthetic Datasets Generator for Testing Information...

    • ieee-dataport.org
    Updated Mar 13, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandro Mendonça (2020). Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools [Dataset]. http://doi.org/10.21227/5aeq-rr34
    Explore at:
    Dataset updated
    Mar 13, 2020
    Dataset provided by
    IEEE Dataport
    Authors
    Sandro Mendonça
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.

  3. f

    Table1_Enhancing biomechanical machine learning with limited data:...

    • frontiersin.figshare.com
    pdf
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich (2024). Table1_Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.pdf [Dataset]. http://doi.org/10.3389/fbioe.2024.1350135.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Frontiers
    Authors
    Carlo Dindorf; Jonas Dully; Jürgen Konradi; Claudia Wolf; Stephan Becker; Steven Simon; Janine Huthwelker; Frederike Werthmann; Johanna Kniepert; Philipp Drees; Ulrich Betz; Michael Fröhlich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.

  4. Synthetic Data Generation Market Analysis North America, Europe, APAC,...

    • technavio.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Synthetic Data Generation Market Analysis North America, Europe, APAC, Middle East and Africa, South America - US, China, Germany, UK, Japan - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/synthetic-data-generation-market-analysis
    Explore at:
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Global
    Description

    Snapshot img

    Synthetic Data Generation Market Size 2024-2028

    The synthetic data generation market size is forecast to increase by USD 2.88 billion at a CAGR of 60.02% between 2023 and 2028.

    The global synthetic data generation market is expanding steadily, driven by the growing need for privacy-compliant data solutions and advancements in AI technology. Key factors include the increasing demand for data to train machine learning models, particularly in industries like healthcare services and finance where privacy regulations are strict and the use of predictive analytics is critical, and the use of generative AI and machine learning algorithms, which create high-quality synthetic datasets that mimic real-world data without compromising security.
    This report provides a detailed analysis of the global synthetic data generation market, covering market size, growth forecasts, and key segments such as agent-based modeling and data synthesis. It offers practical insights for business strategy, technology adoption, and compliance planning. A significant trend highlighted is the rise of synthetic data in AI training, enabling faster and more ethical development of models. One major challenge addressed is the difficulty in ensuring data quality, as poorly generated synthetic data can lead to inaccurate outcomes.
    For businesses aiming to stay competitive in a data-driven global landscape, this report delivers essential data and strategies to leverage synthetic data trends and address quality challenges, ensuring they remain leaders in innovation while meeting regulatory demands
    

    What will be the Size of the Market During the Forecast Period?

    Request Free Sample

    Synthetic data generation offers a more time-efficient solution compared to traditional methods of data collection and labeling, making it an attractive option for businesses looking to accelerate their AI and machine learning projects. The market represents a promising opportunity for organizations seeking to overcome the challenges of data scarcity and privacy concerns while maintaining data diversity and improving the efficiency of their artificial intelligence and machine learning initiatives. By leveraging this technology, technology decision-makers can drive innovation and gain a competitive edge in their respective industries.

    Market Segmentation

    The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

    End-user
    
      Healthcare and life sciences
      Retail and e-commerce
      Transportation and logistics
      IT and telecommunication
      BFSI and others
    
    
    Type
    
      Agent-based modelling
      Direct modelling
    
    
    Data
    
      Tabular Data
      Text Data
      Image & Video Data
      Others
    
    
    Offering Band
    
      Fully Synthetic Data
      Partially Synthetic Data
      Hybrid Synthetic Data
    
    
    Application
    
      Data Protection
      Data Sharing
      Predictive Analytics
      Natural Language Processing
      Computer Vision Algorithms
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
        Mexico
    
    
      Europe
    
        Germany
        UK
        France
        Italy
    
    
      APAC
    
        China
        Japan
        India
    
    
      Middle East and Africa
    
    
    
      South America
    

    By End-user Insights

    The healthcare and life sciences segment is estimated to witness significant growth during the forecast period. In the thriving healthcare and life sciences sector, synthetic data generation is gaining significant traction as a cost-effective and time-efficient alternative to utilizing real-world data. This market segment's rapid expansion is driven by the increasing demand for data-driven insights and the importance of safeguarding sensitive information. One noteworthy application of synthetic data generation is in the realm of computer vision, specifically with geospatial imagery and medical imaging.

    For instance, in healthcare, synthetic data can be generated to replicate medical imaging, such as MRI scans and X-rays, for research and machine learning model development without compromising patient privacy. Similarly, in the field of physical security, synthetic data can be employed to enhance autonomous vehicle simulation, ensuring optimal performance and safety without the need for real-world data. By generating artificial datasets, organizations can diversify their data sources and improve the overall quality and accuracy of their machine learning models.

    Get a glance at the share of various segments. Request Free Sample

    The healthcare and life sciences segment was valued at USD 12.60 million in 2018 and showed a gradual increase during the forecast period.

    Regional Insights

    North America is estimated to contribute 36% to the growth of the global market during the forecast period. Technavio's analysts have elaborately explained the regional trends and drivers that shape the m

  5. Synthetic Data Generation Market Size, Share, Trends & Insights Report, 2035...

    • rootsanalysis.com
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roots Analysis (2024). Synthetic Data Generation Market Size, Share, Trends & Insights Report, 2035 [Dataset]. https://www.rootsanalysis.com/synthetic-data-generation-market
    Explore at:
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Authors
    Roots Analysis
    License

    https://www.rootsanalysis.com/privacy.htmlhttps://www.rootsanalysis.com/privacy.html

    Time period covered
    2021 - 2031
    Area covered
    Global
    Description

    The global synthetic data market size is projected to grow from USD 0.4 billion in the current year to USD 19.22 billion by 2035, representing a CAGR of 42.14%, during the forecast period till 2035

  6. d

    Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats

    • datarade.ai
    Updated Sep 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ainnotate (2022). Synthetic Document Dataset for AI - Jpeg, PNG & PDF formats [Dataset]. https://datarade.ai/data-products/synthetic-document-dataset-for-ai-jpeg-png-pdf-formats-ainnotate
    Explore at:
    Dataset updated
    Sep 17, 2022
    Dataset authored and provided by
    Ainnotate
    Area covered
    Korea (Democratic People's Republic of), Tonga, Cabo Verde, Tokelau, Denmark, Ireland, Germany, Brazil, Syrian Arab Republic, Canada
    Description

    Ainnotate’s proprietary dataset generation methodology based on large scale generative modelling and Domain randomization provides data that is well balanced with consistent sampling, accommodating rare events, so that it can enable superior simulation and training of your models.

    Ainnotate currently provides synthetic datasets in the following domains and use cases.

    Internal Services - Visa application, Passport validation, License validation, Birth certificates Financial Services - Bank checks, Bank statements, Pay slips, Invoices, Tax forms, Insurance claims and Mortgage/Loan forms Healthcare - Medical Id cards

  7. M

    Synthetic Data Generation Market to Surpass USD 6,637.98 Mn By 2034

    • scoop.market.us
    Updated Mar 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market.us Scoop (2025). Synthetic Data Generation Market to Surpass USD 6,637.98 Mn By 2034 [Dataset]. https://scoop.market.us/synthetic-data-generation-market-news/
    Explore at:
    Dataset updated
    Mar 18, 2025
    Dataset authored and provided by
    Market.us Scoop
    License

    https://scoop.market.us/privacy-policyhttps://scoop.market.us/privacy-policy

    Time period covered
    2022 - 2032
    Area covered
    Global
    Description

    Synthetic Data Generation Market Size

    As per the latest insights from Market.us, the Global Synthetic Data Generation Market is set to reach USD 6,637.98 million by 2034, expanding at a CAGR of 35.7% from 2025 to 2034. The market, valued at USD 313.50 million in 2024, is witnessing rapid growth due to rising demand for high-quality, privacy-compliant, and AI-driven data solutions.

    North America dominated in 2024, securing over 35% of the market, with revenues surpassing USD 109.7 million. The region’s leadership is fueled by strong investments in artificial intelligence, machine learning, and data security across industries such as healthcare, finance, and autonomous systems. With increasing reliance on synthetic data to enhance AI model training and reduce data privacy risks, the market is poised for significant expansion in the coming years.

    https://market.us/wp-content/uploads/2025/03/Synthetic-Data-Generation-Market-Size.png" alt="Synthetic Data Generation Market Size" class="wp-image-143209">
  8. Synthetic Data Solution Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AMA Research & Media LLP (2025). Synthetic Data Solution Report [Dataset]. https://www.archivemarketresearch.com/reports/synthetic-data-solution-21817
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    AMA Research & Media
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Market Analysis for Synthetic Data Solution The global synthetic data solution market is projected to reach USD XXX million by 2033, growing at a CAGR of XX% from 2025 to 2033. The increasing demand for synthetic data in various industries, such as financial services, retail, and healthcare, drives this growth. Synthetic data offers a privacy-preserving alternative to real-world data, enabling organizations to train and evaluate models without compromising sensitive information. The growing adoption of cloud-based solutions and the increasing need for data privacy and security further contribute to market growth. Market segments include deployment types (cloud-based and on-premises) and applications (financial services industry, retail industry, medical industry, and others). Key regional markets include North America, South America, Europe, Middle East & Africa, and Asia Pacific. Major companies operating in the market include LightWheel AI, Hanyi Innovation Technology, Haohan Data Technology, Haitian Ruisheng Science Technology, and Baidu. Trends such as the adoption of artificial intelligence (AI) and machine learning (ML) and the rising concern over data privacy and governance are expected to shape the market's future.

  9. S

    Synthetic Data Generation Market Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2024). Synthetic Data Generation Market Report [Dataset]. https://www.marketresearchforecast.com/reports/synthetic-data-generation-market-1834
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Dec 8, 2024
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Synthetic Data Generation Marketsize was valued at USD 288.5 USD Million in 2023 and is projected to reach USD 1920.28 USD Million by 2032, exhibiting a CAGR of 31.1 % during the forecast period.Synthetic data generation stands for the generation of fake datasets that resemble real datasets with reference to their data distribution and patterns. It refers to the process of creating synthetic data points utilizing algorithms or models instead of conducting observations or surveys. There is one of its core advantages: it can maintain the statistical characteristics of the original data and remove the privacy risk of using real data. Further, with synthetic data, there is no limitation to how much data can be created, and hence, it can be used for extensive testing and training of machine learning models, unlike the case with conventional data, which may be highly regulated or limited in availability. It also helps in the generation of datasets that are comprehensive and include many examples of specific situations or contexts that may occur in practice for improving the AI system’s performance. The use of SDG significantly shortens the process of the development cycle, requiring less time and effort for data collection as well as annotation. It basically allows researchers and developers to be highly efficient in their discovery and development in specific domains like healthcare, finance, etc. Key drivers for this market are: Growing Demand for Data Privacy and Security to Fuel Market Growth. Potential restraints include: Lack of Data Accuracy and Realism Hinders Market Growth. Notable trends are: Growing Implementation of Touch-based and Voice-based Infotainment Systems to Increase Adoption of Intelligent Cars.

  10. Distribution of data used when developing AI products South Korea 2023

    • statista.com
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Distribution of data used when developing AI products South Korea 2023 [Dataset]. https://www.statista.com/statistics/1452827/south-korea-share-of-data-used-when-developing-artificial-intelligence-products/
    Explore at:
    Dataset updated
    Sep 19, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2023 - Nov 2023
    Area covered
    South Korea
    Description

    According to a survey of artificial intelligence (AI) companies in South Korea carried out in 2023, nearly 66 percent of the data used when developing AI products and services was private data. On the other hand, public data comprised around 34 percent.

  11. Z

    Training dataset for "A deep learned nanowire segmentation model using...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David, A. Santos (2024). Training dataset for "A deep learned nanowire segmentation model using synthetic data augmentation" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6469772
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Nima, Emami
    Yuting, Luo
    Bai-Xiang, Xu
    Lin, Binbin
    Sarbajit, Banerjee
    David, A. Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This image dataset contains synthetic structure images used for training the deep-learning based nanowire segmentation model presented in our work "A deep learned nanowire segmentation model using synthetic data augmentation" to be published in npj Computational materials. Detailed information can be found in the corresponding article.

  12. S

    Synthetic Data Tool Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Synthetic Data Tool Report [Dataset]. https://www.archivemarketresearch.com/reports/synthetic-data-tool-38973
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global synthetic data tool market is projected to reach USD 10,394.0 million by 2033, exhibiting a CAGR of 34.8% during the forecast period. The growing adoption of AI and ML technologies, increasing demand for data privacy and security, and the rising need for data for training and testing machine learning models are the key factors driving market growth. Additionally, the availability of open-source synthetic data generation tools and the increasing adoption of cloud-based synthetic data platforms are further contributing to market growth. North America is expected to hold the largest market share during the forecast period due to the early adoption of AI and ML technologies and the presence of key vendors in the region. Europe is anticipated to witness significant growth due to increasing government initiatives to promote AI adoption and the growing data privacy concerns. The Asia Pacific region is projected to experience rapid growth due to government initiatives to develop AI capabilities and the increasing adoption of AI and ML technologies in various industries, namely healthcare, retail, and manufacturing.

  13. i

    IIITDMJ_Maize

    • ieee-dataport.org
    Updated Dec 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poornima Singh Thakur (2023). IIITDMJ_Maize [Dataset]. http://doi.org/10.21227/jrw1-md38
    Explore at:
    Dataset updated
    Dec 20, 2023
    Dataset provided by
    IEEE Dataport
    Authors
    Poornima Singh Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The existing datasets lack the diversity required to train the model so that it performs equally well in real fields under varying environmental conditions. To address this limitation, we propose to collect a small number of in-field data and use the GAN to generate synthetic data for training the deep learning network. To demonstrate the proposed method, a maize dataset 'IIITDMJ_Maize' was collected using a drone camera under different weather conditions, including both sunny and cloudy days. The recorded video was processed to sample image frames that were later resized to 224 x 224. Keeping some raw images intact for evaluation purpose, images were further processed to crop only the portion containing diseases and selecting healthy plant images. With the help of agriculture experts, the raw and cropped images were subsequently categorized into four distinct classes -- (a) common rust, (b) northern leaf blight, (c) gray leaf spot, and (d) healthy. In total, 416 images were collected and labeled. Further, 50 raw (un-cropped) images of each category were also selected for testing the model's performance.

  14. d

    Data from: Generation of synthetic whole-slide image tiles of tumours from...

    • search-dev.test.dataone.org
    • search.dataone.org
    • +2more
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Carrillo-Perez; Marija Pizurica; Yuanning Zheng; Tarak Nath Nandi; Ravi Madduri; Jeanne Shen; Olivier Gevaert (2024). Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models [Dataset]. http://doi.org/10.5061/dryad.6djh9w174
    Explore at:
    Dataset updated
    Apr 12, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Francisco Carrillo-Perez; Marija Pizurica; Yuanning Zheng; Tarak Nath Nandi; Ravi Madduri; Jeanne Shen; Olivier Gevaert
    Time period covered
    Jan 1, 2023
    Description

    Data scarcity presents a significant obstacle in the field of biomedicine, where acquiring diverse and sufficient datasets can be costly and challenging. Synthetic data generation offers a potential solution to this problem by expanding dataset sizes, thereby enabling the training of more robust and generalizable machine learning models. Although previous studies have explored synthetic data generation for cancer diagnosis, they have predominantly focused on single-modality settings, such as whole-slide image tiles or RNA-Seq data. To bridge this gap, we propose a novel approach, RNA-Cascaded-Diffusion-Model or RNA-CDM, for performing RNA-to-image synthesis in a multi-cancer context, drawing inspiration from successful text-to-image synthesis models used in natural images. In our approach, we employ a variational auto-encoder to reduce the dimensionality of a patient’s gene expression profile, effectively distinguishing between different types of cancer. Subsequently, we employ a cascad..., , , # RNA-CDM Generated One Million Synthetic Images

    https://doi.org/10.5061/dryad.6djh9w174

    One million synthetic digital pathology images were generated using the RNA-CDM model presented in the paper "RNA-to-image multi-cancer synthesis using cascaded diffusion models".

    Description of the data and file structure

    There are ten different h5 files per cancer type (TCGA-CESC, TCGA-COAD, TCGA-KIRP, TCGA-GBM, TCGA-LUAD). Each h5 file contains 20.000 images. The key is the tile number, ranging from 0-20,000 in the first file, and from 180,000-200,000 in the last file. The tiles are saved as numpy arrays.

    Code/Software

    The code used to generate this data is available under academic license in https://rna-cdm.stanford.edu .

    Manuscript citation

    Carrillo-Perez, F., Pizurica, M., Zheng, Y. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models...

  15. d

    Synthea synthetic patient data for lung cancer risk prediction machine...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, AJ (2023). Synthea synthetic patient data for lung cancer risk prediction machine learning [Dataset]. http://doi.org/10.7910/DVN/GD5XWE
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Chen, AJ
    Description

    This dataset contains Synthea synthetic patient data used in building ML models for lung cancer risk prediction. The ML models are used to simulate ML-enabled LHS. This open dataset is part of the synthetic data repository of the Open LHS project on GitHub: https://github.com/lhs-open/synthetic-data. For data source and methods, see the first ML-LHS simulation paper published in Nature Scientific Reports: https://www.nature.com/articles/s41598-022-23011-4.

  16. f

    Difference in equal opportunity (DEO) of models trained with synthetic data....

    • figshare.com
    • plos.figshare.com
    xls
    Updated Feb 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayana Pereira; Meghana Kshirsagar; Sumit Mukherjee; Rahul Dodhia; Juan Lavista Ferres; Rafael de Sousa (2024). Difference in equal opportunity (DEO) of models trained with synthetic data. [Dataset]. http://doi.org/10.1371/journal.pone.0297271.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 5, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Mayana Pereira; Meghana Kshirsagar; Sumit Mukherjee; Rahul Dodhia; Juan Lavista Ferres; Rafael de Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We measure the DEO of models using real test data—DEO(R) and synthetic test data DEO(S). DEO delta quantifies the difference between DEO(R) and DEO(S). All synthetic data where generated using privacy-loss parameter ϵ = 5.0.

  17. Bioinformatics Simulated

    • kaggle.com
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira gibin (2025). Bioinformatics Simulated [Dataset]. http://doi.org/10.34740/kaggle/dsv/10398445
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Kaggle
    Authors
    willian oliveira gibin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset was created to explore and develop machine learning models in bioinformatics. It contains 20,000 synthetic proteins, each with an amino acid sequence, calculated physicochemical properties, and a functional classification. The dataset includes the following columns: ID_Protein, a unique identifier for each protein; Sequence, a string of amino acids; Molecular_Weight, molecular weight calculated from the sequence; Isoelectric_Point, estimated isoelectric point based on the sequence composition; Hydrophobicity, average hydrophobicity calculated from the sequence; Total_Charge, sum of the charges of the amino acids in the sequence; Polar_Proportion, percentage of polar amino acids in the sequence; Nonpolar_Proportion, percentage of nonpolar amino acids in the sequence; Sequence_Length, total number of amino acids in the sequence; and Class, the functional class of the protein, one of five categories: Enzyme, Transport, Structural, Receptor, Other. While this is a simulated dataset, it was inspired by patterns observed in real protein datasets such as UniProt, a comprehensive database of protein sequences and annotations; the Kyte-Doolittle Scale, calculations of hydrophobicity; and Biopython, a tool for analyzing biological sequences. This dataset is ideal for training classification models for proteins, exploratory analysis of physicochemical properties of proteins, and building machine learning pipelines in bioinformatics. The dataset was created through sequence generation, where amino acid chains were randomly generated with lengths between 50 and 300 residues, property calculation using the Biopython library, and class assignment with classes randomly assigned for classification purposes. However, the sequences and properties do not represent real proteins but follow patterns observed in natural proteins, and the functional classes are simulated and do not correspond to actual biological characteristics. The dataset is divided into two subsets: Training, which includes 16,000 samples (proteinas_train.csv), and Testing, which includes 4,000 samples (proteinas_test.csv). This dataset was inspired by real bioinformatics challenges and designed to help researchers and developers explore machine learning applications in protein analysis.

  18. D

    TiCaM: Synthetic Images Dataset

    • datasetninja.com
    Updated May 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jigyasa Katrolia; Jason Raphael Rambach; Bruno Mirbach (2021). TiCaM: Synthetic Images Dataset [Dataset]. https://datasetninja.com/ticam-synthetic-images
    Explore at:
    Dataset updated
    May 23, 2021
    Dataset provided by
    Dataset Ninja
    Authors
    Jigyasa Katrolia; Jason Raphael Rambach; Bruno Mirbach
    License

    https://spdx.org/licenses/https://spdx.org/licenses/

    Description

    TiCaM Synthectic Images: A Time-of-Flight In-Car Cabin Monitoring Dataset is a time-of-flight dataset of car in-cabin images providing means to test extensive car cabin monitoring systems based on deep learning methods. The authors provide a synthetic image dataset of car cabin images similar to the real dataset leveraging advanced simulation software’s capability to generate abundant data with little effort. This can be used to test domain adaptation between synthetic and real data for select classes. For both datasets the authors provide ground truth annotations for 2D and 3D object detection, as well as for instance segmentation.

  19. f

    Survey data file with analysis.

    • figshare.com
    xlsx
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jennifer K. Wagner; Laura Y. Cabrera; Sara Gerke; Daniel Susser (2024). Survey data file with analysis. [Dataset]. http://doi.org/10.1371/journal.pdig.0000666.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Jennifer K. Wagner; Laura Y. Cabrera; Sara Gerke; Daniel Susser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artificial intelligence (AI) and machine learning (ML) tools are now proliferating in biomedical contexts, and there is no sign this will slow down any time soon. AI/ML and related technologies promise to improve scientific understanding of health and disease and have the potential to spur the development of innovative and effective diagnostics, treatments, cures, and medical technologies. Concerns about AI/ML are prominent, but attention to two specific aspects of AI/ML have so far received little research attention: synthetic data and computational checklists that might promote not only the reproducibility of AI/ML tools but also increased attention to ethical, legal, and social implications (ELSI) of AI/ML tools. We administered a targeted survey to explore these two items among biomedical professionals in the United States. Our survey findings suggest that there is a gap in familiarity with both synthetic data and computational checklists among AI/ML users and developers and those in ethics-related positions who might be tasked with ensuring the proper use or oversight of AI/ML tools. The findings from this survey study underscore the need for additional ELSI research on synthetic data and computational checklists to inform escalating efforts, including the establishment of laws and policies, to ensure safe, effective, and ethical use of AI in health settings.

  20. U

    U.S. AI Training Dataset Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2024). U.S. AI Training Dataset Market Report [Dataset]. https://www.archivemarketresearch.com/reports/us-ai-training-dataset-market-4957
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Dec 11, 2024
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    United States
    Variables measured
    Market Size
    Description

    The U.S. AI Training Dataset Market size was valued at USD 590.4 million in 2023 and is projected to reach USD 1880.70 million by 2032, exhibiting a CAGR of 18.0 % during the forecasts period. The U. S. AI training dataset market deals with the generation, selection, and organization of datasets used in training artificial intelligence. These datasets contain the requisite information that the machine learning algorithms need to infer and learn from. Conducts include the advancement and improvement of AI solutions in different fields of business like transport, medical analysis, computing language, and money related measurements. The applications include training the models for activities such as image classification, predictive modeling, and natural language interface. Other emerging trends are the change in direction of more and better-quality, various and annotated data for the improvement of model efficiency, synthetic data generation for data shortage, and data confidentiality and ethical issues in dataset management. Furthermore, due to arising technologies in artificial intelligence and machine learning, there is a noticeable development in building and using the datasets. Recent developments include: In February 2024, Google struck a deal worth USD 60 million per year with Reddit that will give the former real-time access to the latter’s data and use Google AI to enhance Reddit’s search capabilities. , In February 2024, Microsoft announced around USD 2.1 billion investment in Mistral AI to expedite the growth and deployment of large language models. The U.S. giant is expected to underpin Mistral AI with Azure AI supercomputing infrastructure to provide top-notch scale and performance for AI training and inference workloads. .

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx

Related Article
Explore at:
xlsxAvailable download formats
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Search
Clear search
Close search
Google apps
Main menu