Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature preparation Preprocessing was applied to the data, such as creating dummy variables and performing transformations (centering, scaling, YeoJohnson) using the preProcess() function from the “caret” package in R. The correlation among the variables was examined and no serious multicollinearity problems were found. A stepwise variable selection was performed using a logistic regression model. The final set of variables included: Demographic: age, body mass index, sex, ethnicity, smoking History of disease: heart disease, migraine, insomnia, gastrointestinal disease, COVID-19 history: covid vaccination, rashes, conjunctivitis, shortness of breath, chest pain, cough, runny nose, dysgeusia, muscle and joint pain, fatigue, fever ,COVID-19 reinfection, and ICU admission. These variables were used to train and test various machine-learning models Model selection and training The data was randomly split into 80% training and 20% testing subsets. The “h2o” package in R version 4.3.1 was employed to implement different algorithms. AutoML was first used, which automatically explored a range of models with different configurations. Gradient Boosting Machines (GBM), Random Forest (RF), and Regularized Generalized Linear Model (GLM) were identified as the best-performing models on our data and their parameters were fine-tuned. An ensemble method that stacked different models together was also used, as it could sometimes improve the accuracy. The models were evaluated using the area under the curve (AUC) and C-statistics as diagnostic measures. The model with the highest AUC was selected for further analysis using the confusion matrix, accuracy, sensitivity, specificity, and F1 and F2 scores. The optimal prediction threshold was determined by plotting the sensitivity, specificity, and accuracy and choosing the point of intersection as it balanced the trade-off between the three metrics. The model’s predictions were also plotted, and the quantile ranges were used to classify the model’s prediction as follows: > 1st quantile, > 2nd quantile, > 3rd quartile and < 3rd quartile (very low, low, moderate, high) respectively. Metric Formula C-statistics (TPR + TNR - 1) / 2 Sensitivity/Recall TP / (TP + FN) Specificity TN / (TN + FP) Accuracy (TP + TN) / (TP + TN + FP + FN) F1 score 2 * (precision * recall) / (precision + recall) Model interpretation We used the variable importance plot, which is a measure of how much each variable contributes to the prediction power of a machine learning model. In H2O package, variable importance for GBM and RF is calculated by measuring the decrease in the model's error when a variable is split on. The more a variable's split decreases the error, the more important that variable is considered to be. The error is calculated using the following formula: 𝑆𝐸=𝑀𝑆𝐸∗𝑁=𝑉𝐴𝑅∗𝑁 and then it is scaled between 0 and 1 and plotted. Also, we used The SHAP summary plot which is a graphical tool to visualize the impact of input features on the prediction of a machine learning model. SHAP stands for SHapley Additive exPlanations, a method to calculate the contribution of each feature to the prediction by averaging over all possible subsets of features [28]. SHAP summary plot shows the distribution of the SHAP values for each feature across the data instances. We use the h2o.shap_summary_plot() function in R to generate the SHAP summary plot for our GBM model. We pass the model object and the test data as arguments, and optionally specify the columns (features) we want to include in the plot. The plot shows the SHAP values for each feature on the x-axis, and the features on the y-axis. The color indicates whether the feature value is low (blue) or high (red). The plot also shows the distribution of the feature values as a density plot on the right.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Environmental data may be “large” due to number of records, number of covariates, or both. Random forests has a reputation for good predictive performance when using many covariates with nonlinear relationships, whereas spatial regression, when using reduced rank methods, has a reputation for good predictive performance when using many records that are spatially autocorrelated. In this study, we compare these two techniques using a data set containing the macroinvertebrate multimetric index (MMI) at 1859 stream sites with over 200 landscape covariates. A primary application is mapping MMI predictions and prediction errors at 1.1 million perennial stream reaches across the conterminous United States. For the spatial regression model, we develop a novel transformation procedure that estimates Box-Cox transformations to linearize covariate relationships and handles possibly zero-inflated covariates. We find that the spatial regression model with transformations, and a subsequent selection of significant covariates, has cross-validation performance comparable to random forests. We also find that prediction interval coverage is close to nominal for each method, but that spatial regression prediction intervals tend to be narrower and have less variability than quantile regression forest prediction intervals. A simulation study is used to generalize results and clarify advantages of each modeling approach.
This file contains the Fourier Transform Infrared Spectroscopy (FTIR) Spectroscopy Data from NOAA R/V Ronald H. Brown ship during VOCALS-REx 2008.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an archive of the data contained in the "Transformations" section in PubChem for integration into patRoon and other workflows.
For further details see the ECI GitLab site: README and main "tps" folder.
Credits:
Concepts: E Schymanski, E Bolton, J Zhang, T Cheng;
Code (in R): E Schymanski, R Helmus, P Thiessen
Transformations: E Schymanski, J Zhang, T Cheng and many contributors to various lists!
PubChem infrastructure: PubChem team
Reaction InChI (RInChI) calculations (v1.0): Gerd Blanke (previous versions of these files)
Acknowledgements: ECI team who contributed to related efforts, especially: J. Krier, A. Lai, M. Narayanan, T. Kondic, P. Chirsir, E. Palm. All contributors to the NORMAN-SLE transformations!
March 2025 released as v0.2.0 since the dataset grew by >3000 entries! The stats are:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sustainable land system transformations are necessary to avert biodiversity and climate collapse. However, it remains unclear where entry points for transformations exist in complex land systems. Here, we conceptualize land systems along land-use trajectories, which allows us to identify and evaluate leverage points; i.e., entry points on the trajectory where targeted interventions have particular leverage to influence land-use decisions. We apply this framework in the biodiversity hotspot Madagascar. In the Northeast, smallholder agriculture results in a land-use trajectory originating in old-growth forests, spanning forest fragments, and reaching shifting hill rice cultivation and vanilla agroforests. Integrating interdisciplinary empirical data on seven taxa, five ecosystem services, and three measures of agricultural productivity, we assess trade-offs and co-benefits of land-use decisions at three leverage points along the trajectory. These trade-offs and co-benefits differ between leverage points: two leverage points are situated at the conversion of old-growth forests and forest fragments to shifting cultivation and agroforestry, resulting in considerable trade-offs, especially between endemic biodiversity and agricultural productivity. Here, interventions enabling smallholders to conserve forests are necessary. This is urgent since ongoing forest loss threatens to eliminate these leverage points due to path-dependency. The third leverage point allows for the restoration of land under shifting cultivation through vanilla agroforests and offers co-benefits between restoration goals and agricultural productivity. The co-occurring leverage points highlight that conservation and restoration are simultaneously necessary. Methodologically, the framework shows how leverage points can be identified, evaluated, and harnessed for land system transformations under the consideration of path-dependency along trajectories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset collects a raw dataset and a processed dataset derived from the raw dataset. There is a document containing the analytical code for statistical analysis of the processed dataset in .Rmd format and .html format.
The study examined some aspects of mechanical performance of solid wood composites. We were interested in certain properties of solid wood composites made using different adhesives with different grain orientations at the bondline, then treated at different temperatures prior to testing.
Performance was tested by assessing fracture energy and critical fracture energy, lap shear strength, and compression strength of the composites. This document concerns only the fracture properties, which are the focus of the related paper.
Notes:
* the raw data is provided in this upload, but the processing is not addressed here.
* the authors of this document are a subset of the authors of the related paper.
* this document and the related data files were uploaded at the time of submission for review. An update providing the doi of the related paper will be provided when it is available.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,). To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures. To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R. To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List glmmeg.R: R code demonstrating how to fit a logistic regression model, with a random intercept term, to randomly generated overdispersed binomial data. boot.glmm.R: R code for estimating P-values by applying the bootstrap to a GLMM likelihood ratio statistic. Description glmm.R is some example R code which show how to fit a logistic regression model (with or without a random effects term) and use diagnostic plots to check the fit. The code is run on some randomly generated data, which are generated in such a way that overdispersion is evident. This code could be directly applied for your own analyses if you read into R a data.frame called “dataset”, which has columns labelled “success” and “failure” (for number of binomial successes and failures), “species” (a label for the different rows in the dataset), and where we want to test for the effect of some predictor variable called “location”. In other cases, just change the labels and formula as appropriate. boot.glmm.R extends glmm.R by using bootstrapping to calculate P-values in a way that provides better control of Type I error in small samples. It accepts data in the same form as that generated in glmm.R.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata
To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).
To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.
To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.
To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this project we have reviewed existing methods used to homogenize data and developed several new methods for dealing with this diversity in survey questions on the same subject. The project is a spin-off from the World Database of Happiness, the main aim of which is to collate and make available research findings on the subjective enjoyment of life and to prepare these data for research synthesis. The first methods we discuss were proposed in the book ‘Happiness in Nations’ and which were used at the inception of the World Database of Happiness. Some 10 years later a new method was introduced: the International Happiness Scale Interval Study (HSIS). Taking the HSIS as a basis the Continuum Approach was developed. Then, building on this approach, we developed the Reference Distribution Method.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Functional diversity (FD) is an important component of biodiversity that quantifies the difference in functional traits between organisms. However, FD studies are often limited by the availability of trait data and FD indices are sensitive to data gaps. The distribution of species abundance and trait data, and its transformation, may further affect the accuracy of indices when data is incomplete. Using an existing approach, we simulated the effects of missing trait data by gradually removing data from a plant, an ant and a bird community dataset (12, 59, and 8 plots containing 62, 297 and 238 species respectively). We ranked plots by FD values calculated from full datasets and then from our increasingly incomplete datasets and compared the ranking between the original and virtually reduced datasets to assess the accuracy of FD indices when used on datasets with increasingly missing data. Finally, we tested the accuracy of FD indices with and without data transformation, and the effect of missing trait data per plot or per the whole pool of species. FD indices became less accurate as the amount of missing data increased, with the loss of accuracy depending on the index. But, where transformation improved the normality of the trait data, FD values from incomplete datasets were more accurate than before transformation. The distribution of data and its transformation are therefore as important as data completeness and can even mitigate the effect of missing data. Since the effect of missing trait values pool-wise or plot-wise depends on the data distribution, the method should be decided case by case. Data distribution and data transformation should be given more careful consideration when designing, analysing and interpreting FD studies, especially where trait data are missing. To this end, we provide the R package “traitor” to facilitate assessments of missing trait data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The C2Metadata (“Continuous Capture of Metadata”) Project automates one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software. Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis. Scripts used with statistical software are translated into an independent Structured Data Transformation Language (SDTL), which serves as an intermediate language for describing data transformations. SDTL can be used to add variable-level provenance to data catalogs and codebooks and to create “variable lineages” for auditing software operations. This repository provides examples of scripts and metadata for use in testing C2Metadata tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sets used to prepare Figures 1-14 in the Journal of Non-Crystalline Solids X article entitled "Pressure induced structural transformations in amorphous MgSiO_3 and CaSiO_3." The files are labelled according to the figure numbers. The data sets were created using the methodology described in the manuscript. Each of the plots was drawn using QtGrace (https://sourceforge.net/projects/qtgrace/). The data set corresponding to a plotted curve within an QtGrace file can be identified by clicking on that curve. The units for each axis are identified on the plots.
Figure 1 shows the pressure-volume EOS at room temperature for amorphous and crystalline (a) MgSiO_3 and (b) CaSiO_3.
Figure 2 shows the pressure dependence of the neutron total structure factor S_{N}(k) for amorphous (a) MgSiO_3 and (b) CaSiO_3.
Figure 3 shows the pressure dependence of the neutron total pair-distribution function G_{N}(r) for amorphous (a) MgSiO_3 and (b) CaSiO_3.
Figure 4 shows the pressure dependence of several D′_{N}(r) functions for amorphous MgSiO_3 measured using the D4c diffractometer.
Figure 5 shows the pressure dependence of the Si-O coordination number in amorphous (a) MgSiO_3 and (b) CaSiO_3, the Si-O bond length in amorphous (c) MgSiO_3 and (d) CaSiO_3, and (e) the fraction of n-fold (n = 4, 5, or 6) coordinated Si atoms in these materials.
Figure 6 shows the pressure dependence of the M-O (a) coordination number and (b) bond length for amorphous MgSiO_3 and CaSiO_3.
Figure 7 shows the S_{N}(k) or S_{X}(k) functions for (a) MgSiO_3 and (b) CaSiO_3 after recovery from a pressure of 8.2 or 17.5 GPa.
Figure 8 shows the G_{N}(r) or G_{X}(r) functions for (a) MgSiO_3 and (b) CaSiO_3 after recovery from a pressure of 8.2 or 17.5 GPa.
Figure 9 shows the pressure dependence of the Q^n speciation for fourfold coordinated Si atoms in amorphous (a) MgSiO_3 and (b) CaSiO_3.
Figure 10 shows the pressure dependence in amorphous MgSiO_3 and CaSiO_3 of (a) the overall M-O coordination number and its contributions from M-BO and M-NBO connections, (b) the fractions of M-BO and M-NBO bonds, and (c) the associated M-BO and M-NBO bond distances.
Figure 11 shows the pressure dependence of the fraction of n-fold (n = 4, 5, 6, 7, 8, or 9) coordinated M atoms in amorphous (a) MgSiO_3 and (b) CaSiO_3.
Figure 12 shows the pressure dependence of the O-Si-O, Si-O-Si, Si-O-M, O-M-O and M-O-M bond angle distributions (M = Mg or Ca) for amorphous MgSiO_3 (left hand column) and CaSiO_3 (right hand column).
Figure 13 shows the pressure dependence of the q-parameter distributions for n-fold (n = 4, 5, or 6) coordinated Si atoms in amorphous (a) MgSiO_3 and (b) CaSiO_3.
Figure 14 shows the pressure dependence of the q-parameter distributions for the M atoms in amorphous MgSiO_3 (left hand column) and CaSiO_3 (right hand column).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
fisheries management is generally based on age structure models. thus, fish ageing data are collected by experts who analyze and interpret calcified structures (scales, vertebrae, fin rays, otoliths, etc.) according to a visual process. the otolith, in the inner ear of the fish, is the most commonly used calcified structure because it is metabolically inert and historically one of the first proxies developed. it contains information throughout the whole life of the fish and provides age structure data for stock assessments of all commercial species. the traditional human reading method to determine age is very time-consuming. automated image analysis can be a low-cost alternative method, however, the first step is the transformation of routinely taken otolith images into standardized images within a database to apply machine learning techniques on the ageing data. otolith shape, resulting from the synthesis of genetic heritage and environmental effects, is a useful tool to identify stock units, therefore a database of standardized images could be used for this aim. using the routinely measured otolith data of plaice (pleuronectes platessa; linnaeus, 1758) and striped red mullet (mullus surmuletus; linnaeus, 1758) in the eastern english channel and north-east arctic cod (gadus morhua; linnaeus, 1758), a greyscale images matrix was generated from the raw images in different formats. contour detection was then applied to identify broken otoliths, the orientation of each otolith, and the number of otoliths per image. to finalize this standardization process, all images were resized and binarized. several mathematical morphology tools were developed from these new images to align and to orient the images, placing the otoliths in the same layout for each image. for this study, we used three databases from two different laboratories using three species (cod, plaice and striped red mullet). this method was approved to these three species and could be applied for others species for age determination and stock identification.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source data for Brabham et al. (2024) bioRxiv that includes raw data, uncropped images, and scripts used for data analysis and figure preparation. https://doi.org/10.1101/2024.06.25.599845
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Inner Mongolia: Petroleum Product: Input & Output of Transformation: Petroleum Refinery data was reported at 3,956.800 Ton th in 2022. This records an increase from the previous number of 3,820.000 Ton th for 2021. Inner Mongolia: Petroleum Product: Input & Output of Transformation: Petroleum Refinery data is updated yearly, averaging 53.180 Ton th from Dec 1995 (Median) to 2022, with 28 observations. The data reached an all-time high of 3,956.800 Ton th in 2022 and a record low of -194.100 Ton th in 2008. Inner Mongolia: Petroleum Product: Input & Output of Transformation: Petroleum Refinery data remains active status in CEIC and is reported by National Bureau of Statistics. The data is categorized under China Premium Database’s Energy Sector – Table CN.RBJ: Petroleum Product Balance Sheet: Inner Mongolia.
The data contains inequality measures at the municipality-level for 1892 and 1871, as estimated in the PhD thesis "Institutions, Inequality and Societal Transformations" by Sara Moricz. The data also contains the source publications: 1) tabel 1 from “Bidrag till Sverige official statistik R) Valstatistik. XI. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1892” (biSOS R 1892) 2) tabel 1 from “Bidrag till Sverige official statistik R) Valstatistik. II. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1871” (biSOS R 1871)
A UTF-8 encoded .csv-file. Each row is a municipality of the agricultural sample (2222 in total). Each column is a variable.
R71muncipality_id: a unique identifier for the municipalities in the R1871 publication (the municipality name can be obtained from the source data) R92muncipality_id: a unique identifier for the municipalities in the R1892 publication (the municipality name can be obtained from the source data) agriTop1_1871: an ordinal measure (ranking) of the top 1 income share in the agricultural sector for 1871 agriTop1_1892: an ordinal measure (ranking) of the top 1 income share in the agricultural sector for 1892 highestFarm_1871: a cardinal measure of the top 1 person share in the agricultural sector for 1871 highestFarm_1871: a cardinal measure of the top 1 person share in the agricultural sector for 1892
A UTF-8 encoded .csv-file. Each row is a municipality of the industrial sample (1328 in total). Each column is a variable.
R71muncipality_id: see above description R92muncipality_id: see above description indTop1_1871: an ordinal measure (ranking) of the top 1 income share in the industrial sector for 1871 indTop1_1892: an ordinal measure (ranking) of the top 1 income share in the industrial sector for 1892
A UTF-8 encoded .csv-file with the source data. The variables are described in the adherent codebook moricz_R1892_source_data_codebook.csv.
Contains table 1 from “Bidrag till Sverige official statistik R) Valstatistik. XI. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1892” (biSOS R 1892). SCB provides the scanned publication on their website. Dollar Typing Service typed and delivered the data in 2015. All numerical variables but two have been checked. This is easy to do since nearly all columns should sum up to another column. For “Folkmangd” (population) the numbers have been corrected against U1892. The highest estimate of errors in the variables is 0.005 percent (0.5 promille), calculated at cell level. The two numerical variables which have not been checked is “hogsta_fyrk_jo“ and “hogsta_fyrk_ov“, as this cannot much be compared internally in the data. According to my calculations as the worst case scenario, I have measurement errors of 0.0043 percent (0.43 promille) in those variables.
A UTF-8 encoded .csv-file with the source data. The variables are described in the adherent codebook moricz_R1871_source_data_codebook.csv.
Contains table 1 from “Bidrag till Sverige official statistik R) Valstatistik. II. Statistiska Centralbyråns underdåniga berättelse rörande kommunala rösträtten år 1871” (biSOS R 1871). SCB provides the scanned publication on their website. Dollar Typing Service typed and delivered the data in 2015. The variables have been checked for accuracy, which is feasible since columns and rows should sum. The variables that most likely carry mistakes are “hogsta_fyrk_al” and “hogsta_fyrk_jo”.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This research tested the effects of six chemotherapeutic drugs on the transformation frequency and growth rate of A. baylyi. Binomial models were constructed for each drug with emmeans postdoc tests to infer effect of drug concentration on A. baylyi behaviour. Significances suggested by the emmeans tests were annotated onto ggplot boxplots and assembled into one panel graph.Supplemental PDF is provided which displays relevant statistical outputs from data analysis.
Prepare a prediction model for profit of 50_startups data. Do transformations for getting better predictions of profit and make a table containing R^2 value for each prepared model.
R&D Spend -- Research and devolop spend in the past few years Administration -- spend on administration in the past few years Marketing Spend -- spend on Marketing in the past few years State -- states from which data is collected Profit -- profit of each state in the past few years
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature preparation Preprocessing was applied to the data, such as creating dummy variables and performing transformations (centering, scaling, YeoJohnson) using the preProcess() function from the “caret” package in R. The correlation among the variables was examined and no serious multicollinearity problems were found. A stepwise variable selection was performed using a logistic regression model. The final set of variables included: Demographic: age, body mass index, sex, ethnicity, smoking History of disease: heart disease, migraine, insomnia, gastrointestinal disease, COVID-19 history: covid vaccination, rashes, conjunctivitis, shortness of breath, chest pain, cough, runny nose, dysgeusia, muscle and joint pain, fatigue, fever ,COVID-19 reinfection, and ICU admission. These variables were used to train and test various machine-learning models Model selection and training The data was randomly split into 80% training and 20% testing subsets. The “h2o” package in R version 4.3.1 was employed to implement different algorithms. AutoML was first used, which automatically explored a range of models with different configurations. Gradient Boosting Machines (GBM), Random Forest (RF), and Regularized Generalized Linear Model (GLM) were identified as the best-performing models on our data and their parameters were fine-tuned. An ensemble method that stacked different models together was also used, as it could sometimes improve the accuracy. The models were evaluated using the area under the curve (AUC) and C-statistics as diagnostic measures. The model with the highest AUC was selected for further analysis using the confusion matrix, accuracy, sensitivity, specificity, and F1 and F2 scores. The optimal prediction threshold was determined by plotting the sensitivity, specificity, and accuracy and choosing the point of intersection as it balanced the trade-off between the three metrics. The model’s predictions were also plotted, and the quantile ranges were used to classify the model’s prediction as follows: > 1st quantile, > 2nd quantile, > 3rd quartile and < 3rd quartile (very low, low, moderate, high) respectively. Metric Formula C-statistics (TPR + TNR - 1) / 2 Sensitivity/Recall TP / (TP + FN) Specificity TN / (TN + FP) Accuracy (TP + TN) / (TP + TN + FP + FN) F1 score 2 * (precision * recall) / (precision + recall) Model interpretation We used the variable importance plot, which is a measure of how much each variable contributes to the prediction power of a machine learning model. In H2O package, variable importance for GBM and RF is calculated by measuring the decrease in the model's error when a variable is split on. The more a variable's split decreases the error, the more important that variable is considered to be. The error is calculated using the following formula: 𝑆𝐸=𝑀𝑆𝐸∗𝑁=𝑉𝐴𝑅∗𝑁 and then it is scaled between 0 and 1 and plotted. Also, we used The SHAP summary plot which is a graphical tool to visualize the impact of input features on the prediction of a machine learning model. SHAP stands for SHapley Additive exPlanations, a method to calculate the contribution of each feature to the prediction by averaging over all possible subsets of features [28]. SHAP summary plot shows the distribution of the SHAP values for each feature across the data instances. We use the h2o.shap_summary_plot() function in R to generate the SHAP summary plot for our GBM model. We pass the model object and the test data as arguments, and optionally specify the columns (features) we want to include in the plot. The plot shows the SHAP values for each feature on the x-axis, and the features on the y-axis. The color indicates whether the feature value is low (blue) or high (red). The plot also shows the distribution of the feature values as a density plot on the right.