4 datasets found

f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
f
Dataset for: A Model for Antarctic Surface Mass Balance and Ice Core Site...
wiley.figshare.com
text/x-tex
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philip Andrew White; C. Shane Reese; William F. Christensen; Summer Rupper (2023). Dataset for: A Model for Antarctic Surface Mass Balance and Ice Core Site Selection [Dataset]. http://doi.org/10.6084/m9.figshare.8044745.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8044745.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Philip Andrew White; C. Shane Reese; William F. Christensen; Summer Rupper
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Antarctica
Description
In this study, we develop a model for Antarctic surface mass balance (SMB) that allows us to assess regional and global uncertainty in SMB estimation and carry out a model-based design to propose new measurement sites. For this analysis, we use a quality-controlled aggregate dataset of SMB field measurements with significantly more observations than previous analyses; however, many of the measurements in this dataset lack quality ratings. In addition, these data demonstrate spatial autocorrelation, heteroscedasticity, and non-Gaussianity. To account for these data attributes, we pose a Bayesian Gaussian process generalized linear model for SMB. To address missing reliability ratings, we use a mixture model with different variances to add robustness to our model. In addition, we present a novel approach for modeling the variance as a function of the mean to account for the heteroscedasticity in the data. Using this model, we predict Antarctic SMB and compare our estimates with previous estimates. In addition, we create prediction maps with uncertainty to visualize spatial patterns in SMB and to identify regions of high SMB uncertainty. Our model estimates total SMB to be 2156 Gton/yr over the range of our data, with 95\% credible interval (2081,2234) Gton/yr. Overall, our results suggest lower Antarctic SMB than previously reported. This lower SMB estimate may be indicative of a more dire diagnosis of the long-term health of the Antarctic ice sheets. Lastly, we use our model to propose 25 new measurement sites for field study utilizing a sequential design minimizing integrated mean squared error.
s
Maps of the diversity and distribution of Raunkiær's life forms in European...
research.science.eus
ekoizpen-zientifikoa.ehu.eus
+1more
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maps of the diversity and distribution of Raunkiær's life forms in European vegetation [Dataset]. https://research.science.eus/documentos/67321dfdaea56d4af0485039
Explore at:
Dataset updated
2023
Authors
Midolo, Gabriele; Axmanová, Irena; Divíšek, Jan; Dřevojan, Pavel; Lososová, Zdeňka; Večeřa, Martin; Karger, Dirk Nikolaus; Thuiller, Wilfried; Bruelheide, Helge; Aćić, Svetlana; Attorre, Fabio; Biurrun, Idoia; Bonari, Gianmaria; Čarni, Andraž; Chiarucci, Alessandro; Ćušterevska, Renata; Dengler, Jürgen; Dziuba, Tetiana; Garbolino, Emmanuel; Lenoir, Jonathan; Marcenò, Corrado; Rūsiņa, Solvita; Šibík, Jozef; Škvorc, Željko; Stančić, Zvjezdana; Stanišić-Vujačić, Milica; Svenning, Jens-Christian; Swacha, Grzegorz; Vassilev, Kiril; Chytrý, Milan; Midolo, Gabriele; Axmanová, Irena; Divíšek, Jan; Dřevojan, Pavel; Lososová, Zdeňka; Večeřa, Martin; Karger, Dirk Nikolaus; Thuiller, Wilfried; Bruelheide, Helge; Aćić, Svetlana; Attorre, Fabio; Biurrun, Idoia; Bonari, Gianmaria; Čarni, Andraž; Chiarucci, Alessandro; Ćušterevska, Renata; Dengler, Jürgen; Dziuba, Tetiana; Garbolino, Emmanuel; Lenoir, Jonathan; Marcenò, Corrado; Rūsiņa, Solvita; Šibík, Jozef; Škvorc, Željko; Stančić, Zvjezdana; Stanišić-Vujačić, Milica; Svenning, Jens-Christian; Swacha, Grzegorz; Vassilev, Kiril; Chytrý, Milan
Description
This repository contains raster files (TIF format) with a 50 km × 50 km resolution (over UTM grid EPSG:32633), showcasing the diversity and distribution of Raunkiær’s life forms in European vegetation. The maps are based on two key metrics: (i) the proportion (%) of species within each life form and (ii) the diversity of life forms, including richness and evenness.

To generate these maps, we averaged plot-level metric values across a comprehensive dataset comprising 546,501 vegetation plots sourced from the European Vegetation Archive (EVA; Project 163; https://euroveg.org). These plots cover diverse habitats, including 173,190 forests, 260,884 grasslands, 52,517 scrubs, and 59,910 wetlands.

The maps encompass the entire dataset, offering a visualization of the geographical distribution patterns of life forms across Europe. Additionally, we created habitat-specific maps by subsetting the dataset to explore unique patterns within each habitat type (forest, grassland, scrub, and wetland).

Furthermore, we generated additional maps based on standardised effect sizes (SES) of diversity metrics. Through 500 species identity shuffles without replacement, specific to each habitat type, we examined the deviations from random expectations. SES values outside the range of -1.96 to 1.96 indicate significantly lower or higher metric values than expected at random, respectively.

Folder name Description of TIF raster values

full.div Mean richness and evenness of life forms across all habitat types

full.mean.rel.prop Mean proportion of each life form across all habitat types

habitat.div Mean richness and evenness of life forms across separate habitat types (forest, grassland, scrub, and wetland)

habitat.mean.rel.prop Mean proportion of each life form across separate habitat types (forest, grassland, scrub, and wetland)

SES.full.div Mean richness and evenness of life forms across all habitat types measured with standardized effect sizes (SES)

SES.full.mean.rel.prop Mean proportion of each life form across all habitat types measured with standardized effect sizes (SES)

SES.habitat.div Mean richness and evenness of life forms across separate habitat types (forest, grassland, scrub, and wetland) measured with standardized effect sizes (SES)

SES.habitat.mean.rel.prop Mean proportion of each life form across separate habitat types (forest, grassland, scrub, and wetland) measured with standardized effect sizes (SES)

Additional information is available in our publication:Midolo, G., Axmanová, I., Divíšek, J., Dřevojan, P., Lososová, Z., Večeřa, M., Karger, D. N., Thuiller, W., Bruelheide, H., Aćić, S., Attorre, F., Biurrun, I., Boch, S., Bonari, G., Čarni, A., Chiarucci, A., Ćušterevska, R., Dengler, J., Dziuba, T., Garbolino, E., Jandt, U., Lenoir, J., Marcenò, C., Rūsiņa, S., Šibík, J., Škvorc, Ž., Stančić, Z., Stanišić-Vujačić, M., Svenning, J. C., Swacha, G., Vassilev, K., & Chytrý, M. (2024) Diversity and distribution of Raunkiær’s life forms in European vegetation. Journal of Vegetation Science. Accepted on the 10th of December 2023
Balance sheet estimates (Economic trends, 1980 and 1981)
ons.gov.uk
cy.ons.gov.uk
xls
Updated Jan 12, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for National Statistics (2016). Balance sheet estimates (Economic trends, 1980 and 1981) [Dataset]. https://www.ons.gov.uk/economy/nationalaccounts/uksectoraccounts/datasets/balancesheetestimateseconomictrends1980and1981
Explore at:
xlsAvailable download formats
Dataset updated
Jan 12, 2016
Dataset provided by
Office for National Statisticshttp://www.ons.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Annual balance sheet estimates 1957 to 1986, for a range of sectors, from Economic Trends and Financial Statistics.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.3389/frai.2025.1533508.s002

Dataset updated

Feb 5, 2025

Dataset provided by

Frontiers

Authors

Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Clear search

Close search

Google apps

Main menu

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

Dataset for: A Model for Antarctic Surface Mass Balance and Ice Core Site...

Maps of the diversity and distribution of Raunkiær's life forms in European...

Balance sheet estimates (Economic trends, 1980 and 1981)

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsxSee More Versions

Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx