12 datasets found

f
Data Sheet 2_Development and validation of a machine learning-based risk...
frontiersin.figshare.com
docx
Updated Jun 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi Cao; Haipeng Deng; Shaoyun Liu; Xi Zeng; Yangyang Gou; Weiting Zhang; Yixinyuan Li; Hua Yang; Min Peng (2025). Data Sheet 2_Development and validation of a machine learning-based risk prediction model for stroke-associated pneumonia in older adult hemorrhagic stroke.docx [Dataset]. http://doi.org/10.3389/fneur.2025.1591570.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fneur.2025.1591570.s002
Dataset updated
Jun 18, 2025
Dataset provided by
Frontiers
Authors
Yi Cao; Haipeng Deng; Shaoyun Liu; Xi Zeng; Yangyang Gou; Weiting Zhang; Yixinyuan Li; Hua Yang; Min Peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveTo develop and validate a machine learning (ML)-based model for predicting stroke-associated pneumonia (SAP) risk in older adult hemorrhagic stroke patients.MethodsA retrospective collection of older adult hemorrhagic stroke patients from three tertiary hospitals in Guiyang, Guizhou Province (January 2019–December 2022) formed the modeling cohort, randomly split into training and internal validation sets (7:3 ratio). External validation utilized retrospective data from January–December 2023. After univariate and multivariate regression analyses, four ML models (Logistic Regression, XGBoost, Naive Bayes, and SVM) were constructed. Receiver operating characteristic (ROC) curves and area under the curve (AUC) were calculated for training and internal validation sets. Model performance was compared using Delong's test or Bootstrap test, while sensitivity, specificity, accuracy, precision, recall, and F1-score evaluated predictive efficacy. Calibration curves assessed model calibration. The optimal model underwent external validation using ROC and calibration curves.ResultsA total of 788 older adult hemorrhagic stroke patients were enrolled, divided into a training set (n = 462), an internal validation set (n = 196), and an external validation set (n = 130). The incidence of SAP in older adult patients with hemorrhagic stroke was 46.7% (368/788). Advanced age [OR = 1.064, 95% CI (1.024, 1.104)], smoking[OR = 2.488, 95% CI (1.460, 4.24)], low GCS score [OR = 0.675, 95% CI (0.553, 0.825)], low Braden score [OR = 0.741, 95% CI (0.640, 0.858)], and nasogastric tube [OR = 1.761, 95% CI (1.048, 2.960)] were identified as risk factors for SAP. Among the four machine learning algorithms evaluated [XGBoost, Logistic Regression (LR), Support Vector Machine (SVM), and Naive Bayes], the LR model demonstrated robust and consistent performance in predicting SAP among older adult patients with hemorrhagic stroke across multiple evaluation metrics. Furthermore, the model exhibited stable generalizability within the external validation cohort. Based on these findings, the LR framework was subsequently selected for external validation, accompanied by a nomogram visualization. The model achieved AUC values of 0.883 (training), 0.855 (internal validation), and 0.882 (external validation). The Hosmer-Lemeshow (H-L) test indicates that the calibration of the model is satisfactory in all three datasets, with P-values of 0.381, 0.142, and 0.066 respectively.ConclusionsThis study constructed and validated a risk prediction model for SAP in older adult patients with hemorrhagic stroke based on multi-center data. The results indicated that among the four machine learning algorithms (XGBoost, LR, SVM, and Naive Bayes), the LR model demonstrated the best and most stable predictive performance. Age, smoking, low GCS score, low Braden score, and nasogastric tube were identified as predictive factors for SAP in these patients. These indicators are easily obtainable in clinical practice and facilitate rapid bedside assessment. Through internal and external validation, the model was proven to have good generalization ability, and a nomogram was ultimately drawn to provide an objective and operational risk assessment tool for clinical nursing practice. It helps in the early identification of high-risk patients and guides targeted interventions, thereby reducing the incidence of SAP and improving patient prognosis.
Data for: Advances and critical assessment of machine learning techniques...
zenodo.org
dataone.org
+2more
bin, csv
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Bucinsky; Marián Gall; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč; Lukas Bucinsky; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zgmsbccg7
Dataset updated
Sep 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lukas Bucinsky; Marián Gall; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč; Lukas Bucinsky; Ján Matúška; Michal Pitoňák; Marek Štekláč
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease M^pro (PDB ID: 6WQF).

Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected.

The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study.

The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).
Global ML-ready dataset for mining areas in satellite images
zenodo.org
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Jasansky; Simon Jasansky; Victor Maus; Mirela Popa; Anna Wilbik; Anna Wilbik; Victor Maus; Mirela Popa (2024). Global ML-ready dataset for mining areas in satellite images [Dataset]. http://doi.org/10.5281/zenodo.14195737
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14195737
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simon Jasansky; Simon Jasansky; Victor Maus; Mirela Popa; Anna Wilbik; Anna Wilbik; Victor Maus; Mirela Popa
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is a global resource for machine learning applications in mining area detection and semantic segmentation on satellite imagery. It contains Sentinel-2 satellite images and corresponding mining area masks + bounding boxes for 1,210 sites worldwide. Ground-truth masks are derived from Maus et al. (2022) and Tang et al. (2023), and validated through manual verification to ensure accurate alignment with Sentinel-2 imagery from specific timestamps.

The dataset includes three mask variants:

Masks exclusively from Maus et al. (n=1,090)

Masks exclusively from Tang et al. (n=817)

A preferred mask selected from either Maus or Tang based on alignment quality determined during manual review (n=1,210).

Each tile corresponds to a 2048x2048 pixel Sentinel-2 image, with metadata on mine type (surface, placer, underground, brine & evaporation) and scale (artisanal, industrial). For convenience, the preferred mask dataset is already split into training (75%), validation (15%), and test (10%) sets.

Furthermore, dataset quality was validated by re-validating test set tiles manually and correcting any mismatches between mining polygons and visually observed true mining area in the images, resulting in the following estimated quality metrics:

Combined Maus Tang
Accuracy 99.78 99.74 99.83
Precision 99.22 99.20 99.24
Recall 95.71 96.34 95.10

Note that the dataset does not contain the Sentinel-2 images themselves but contains a reference to specific Sentinel-2 images. Thus, for any ML applications, the images must be persisted first. For example, Sentinel-2 imagery is available from Microsoft's Planetary Computer and filterable via STAC API: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a. Additionally, the temporal specificity of the data allows integration with other imagery sources from the indicated timestamp, such as Landsat or other high-resolution imagery.

Source code used to generate this dataset and to use it for ML model training is available at https://github.com/SimonJasansky/mine-segmentation. It includes useful Python scripts, e.g. to download Sentinel-2 images via STAC API, or to divide tile images (2048x2048px) into smaller chips (e.g. 512x512px).

A database schema, a schematic depiction of the dataset generation process, and a map of the global distribution of tiles are provided in the accompanying images.
o
Norwegian Review Corpus
opendatabay.com
.undefined
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Norwegian Review Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/63c6a08b-9428-4d44-a176-e103956978ad
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Datasimple
Area covered
Reviews & Ratings
Description
This dataset is a slightly modified version of the NoReC dataset for document-level sentiment analysis. The data points remain unchanged, with the only adjustment being the compilation into a CSV file for ease of use. This straightforward approach ensures the dataset's simplicity and accessibility while preserving the authenticity of the original content.

Usage The dataset contains a split column which can be used to split the dataset into training, validation and test sets. However, feel free to split the dataset as you see fit.

License

CC-BY-NC

Original Data Source: Norwegian Review Corpus
UCI dataset
springernature.figshare.com
bin
Updated Mar 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen (2023). UCI dataset [Dataset]. http://doi.org/10.6084/m9.figshare.20496258.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20496258.v1
Dataset updated
Mar 13, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Wan-Ting Hsieh; Sergio González Vázquez; Trista Chen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).

[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
Navas-Olive, Rubio, et al. (2023). Figure 6 - data
figshare.com
bin
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liset M de la Prida; Andrea Navas-Olive; Adrian Rubio (2024). Navas-Olive, Rubio, et al. (2023). Figure 6 - data [Dataset]. http://doi.org/10.6084/m9.figshare.24999185.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24999185.v2
Dataset updated
Jan 15, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Liset M de la Prida; Andrea Navas-Olive; Adrian Rubio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Figure 6. Extending sharp-wave ripple detection to non-human primates. c) Significant differences between SWR recorded in mice and monkey. d) The best model of each architecture trained in mouse data, and the best filter configuration for mouse data, were applied to detect SWRs on the macaque data. We evaluated all models by computing F1-score against the ground truth (GT). Note relatively good results from non-retrained ML models and filter. e) Results of model re-training using macaque data. Data were split into a training and validation dataset (50% and 20% respectively), used to train the ML models; and a test set (30%), used to compute the F1 (left panel). Filter was not re-trained. f) F1-scores for the maximal performance of each model before and after re-training.
Met Office UKCP Local CPM precipitation ML emulator dataset
zenodo.org
application/gzip
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henry Addison; Henry Addison; Elizabeth Kendon; Elizabeth Kendon; Suman Ravuri; Suman Ravuri; Laurence Aitchison; Laurence Aitchison; Peter AG Watson; Peter AG Watson (2024). Met Office UKCP Local CPM precipitation ML emulator dataset [Dataset]. http://doi.org/10.5281/zenodo.11504859
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11504859
Dataset updated
Jun 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Henry Addison; Henry Addison; Elizabeth Kendon; Elizabeth Kendon; Suman Ravuri; Suman Ravuri; Laurence Aitchison; Laurence Aitchison; Peter AG Watson; Peter AG Watson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Met Office UKCP Local CPM precipitation ML emulator dataset

This is a collection of two datasets: one sourced from CPM data (bham_gcmx-4x_12em_psl-sphum4th-temp4th-vort4th_eqvt_random-season.tar.gz) and one sourced from GCM data (bham_60km-4x_12em_psl-sphum4th-temp4th-vort4th_eqvt_random-season.tar.gz). Each dataset is made up of climate model variables extracted from the Met Office's storage system, combining many variables over many years. It consists of 3 NetCDF files (train.nc, test.nc and val.nc), a YML ds-config.yml file and a README (similar to this one but tailored to the source of the data). Code used to create the dataset can be found here: https://github.com/henryaddison/mlde-data (specifically the v0.1.0 tag: https://github.com/henryaddison/mlde-data/tree/v0.1.0).

The YML file contains the configuration for the creation of the dataset, including the variables, scenario, ensemble members, spatial domain and resolution, and the scheme for splitting the data across the three subsets.

Each NetCDF contains the same variables but split into different subsets (train, val and test) of the based on time dimension.

Otherwise the NetCDF files have the sames dimensions and coordinates for ensemble_member, grid_longitude and grid_latitude.

Spatial resolution: This has two parts - the resolution of the data and the grid resolution stored at in the file. For predictand variables this is 2.2km variables coarsened 4 times to 8.8km (this is the target grid). For predictor variables this is 2.2km variables conservatively regriddded to GCM 60km grid or variables from GCM (so already on 60km grid) then regrid (nearest neighbour) to the target grid of predictands. In the naming convention of resolution used in config files, 60km resolution is synonamous with the GCM grid and 2.2km resolution is synonamous with the CPM grid.

Spatial domain: A 64x64 section of the 8.8km target grid covering England and Wales

Time resolution: daily

Time domain: 1st Dec 1980 to 30th Nov 2000; 1st Dec 2020 to 30th Nov 2040; 1st Dec 2060 to 30th Nov 2080. Uses a 360-day calendar.

Scenario: RCP8.5

Ensemble Members: 01, 04-13 & 15 (these correspond to the 12 ensemble member runs from the CPM but don't carry intrinsic meaning).

Split scheme: 70% training, 15% validation, 15% testing, split by choosing complete seasons at random, with an equal number of each season from each of the 3 time periods.

Predictor variables

psl (hPa) - mean sea level pressure

temp850, temp700, temp500, temp250 - air temperature (K) at 850, 700, 500 and 250 hPa

vorticity850, vorticity700, vorticity500, vorticity250 - relative vorticity (s^-1) at 850, 700, 500 and 250 hPa

spechum850, spechum700, spechum500, spechum250 - specific humidity at 850, 700, 500 and 250 hPa

Predictand variable

target_pr - precipitation rate (mm/day)
f
Data_Sheet_7_Prediction model of acute kidney injury after different types...
frontiersin.figshare.com
txt
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Xinsai; Wang Zhengye; Huang Xuan; Chu Xueqian; Peng Kai; Chen Sisi; Jiang Xuyan; Li Suhua (2023). Data_Sheet_7_Prediction model of acute kidney injury after different types of acute aortic dissection based on machine learning.CSV [Dataset]. http://doi.org/10.3389/fcvm.2022.984772.s007
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fcvm.2022.984772.s007
Dataset updated
Jun 16, 2023
Dataset provided by
Frontiers
Authors
Li Xinsai; Wang Zhengye; Huang Xuan; Chu Xueqian; Peng Kai; Chen Sisi; Jiang Xuyan; Li Suhua
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveA clinical prediction model for postoperative combined Acute kidney injury (AKI) in patients with Type A acute aortic dissection (TAAAD) and Type B acute aortic dissection (TBAAD) was constructed by using Machine Learning (ML).MethodsBaseline data was collected from Acute aortic division (AAD) patients admitted to First Affiliated Hospital of Xinjiang Medical University between January 1, 2019 and December 31, 2021. (1) We identified baseline Serum creatinine (SCR) estimation methods and used them as a basis for diagnosis of AKI. (2) Divide their total datasets randomly into Training set (70%) and Test set (30%), Bootstrap modeling and validation of features using multiple ML methods in the training set, and select models corresponding to the largest Area Under Curve (AUC) for follow-up studies. (3) Screening of the best ML model variables through the model visualization tools Shapley Addictive Explanations (SHAP) and Recursive feature reduction (REF). (4) Finally, the pre-screened prediction models were evaluated using test set data from three aspects: discrimination, Calibration, and clinical benefit.ResultsThe final incidence of AKI was 69.4% (120/173) in 173 patients with TAAAD and 28.6% (81/283) in 283 patients with TBAAD. For TAAAD-AKI, the Random Forest (RF) model showed the best prediction performance in the training set (AUC = 0.760, 95% CI:0.630–0.881); while for TBAAD-AKI, the Light Gradient Boosting Machine (LightGBM) model worked best (AUC = 0.734, 95% CI:0.623–0.847). Screening of the characteristic variables revealed that the common predictors among the two final prediction models for postoperative AKI due to AAD were baseline SCR, Blood urea nitrogen (BUN) and Uric acid (UA) at admission, Mechanical ventilation time (MVT). The specific predictors in the TAAAD-AKI model are: White blood cell (WBC), Platelet (PLT) and D dimer at admission, Plasma The specific predictors in the TBAAD-AKI model were N-terminal pro B-type natriuretic peptide (BNP), Serum kalium, Activated partial thromboplastin time (APTT) and Systolic blood pressure (SBP) at admission, Combined renal arteriography in surgery. Finally, we used in terms of Discrimination, the ROC value of the RF model for TAAAD was 0.81 and the ROC value of the LightGBM model for TBAAD was 0.74, both with good accuracy. In terms of calibration, the calibration curve of TAAAD-AKI's RF fits the ideal curve the best and has the lowest and smallest Brier score (0.16). Similarly, the calibration curve of TBAAD-AKI's LightGBM model fits the ideal curve the best and has the smallest Brier score (0.15). In terms of Clinical benefit, the best ML models for both types of AAD have good Net benefit as shown by Decision Curve Analysis (DCA).ConclusionWe successfully constructed and validated clinical prediction models for the occurrence of AKI after surgery in TAAAD and TBAAD patients using different ML algorithms. The main predictors of the two types of AAD-AKI are somewhat different, and the strategies for early prevention and control of AKI are also different and need more external data for validation.
T
food101
tensorflow.org
paperswithcode.com
+3more
Updated Nov 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). food101 [Dataset]. https://www.tensorflow.org/datasets/catalog/food101
Explore at:
Dataset updated
Nov 23, 2022
Description
This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('food101', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">
f
Table_3_Comparing feature selection and machine learning approaches for...
frontiersin.figshare.com
xlsx
Updated Feb 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Jing Fong; Hong Ming Tan; Rishabh Garg; Ai Ling Teh; Hong Pan; Varsha Gupta; Bernadus Krishna; Zou Hui Chen; Natania Yovela Purwanto; Fabian Yap; Kok Hian Tan; Kok Yen Jerry Chan; Shiao-Yng Chan; Nicole Goh; Nikita Rane; Ethel Siew Ee Tan; Yuheng Jiang; Mei Han; Michael Meaney; Dennis Wang; Jussi Keppo; Geoffrey Chern-Yee Tan (2024). Table_3_Comparing feature selection and machine learning approaches for predicting CYP2D6 methylation from genetic variation.xlsx [Dataset]. http://doi.org/10.3389/fninf.2023.1244336.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2023.1244336.s004
Dataset updated
Feb 21, 2024
Dataset provided by
Frontiers
Authors
Wei Jing Fong; Hong Ming Tan; Rishabh Garg; Ai Ling Teh; Hong Pan; Varsha Gupta; Bernadus Krishna; Zou Hui Chen; Natania Yovela Purwanto; Fabian Yap; Kok Hian Tan; Kok Yen Jerry Chan; Shiao-Yng Chan; Nicole Goh; Nikita Rane; Ethel Siew Ee Tan; Yuheng Jiang; Mei Han; Michael Meaney; Dennis Wang; Jussi Keppo; Geoffrey Chern-Yee Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionPharmacogenetics currently supports clinical decision-making on the basis of a limited number of variants in a few genes and may benefit paediatric prescribing where there is a need for more precise dosing. Integrating genomic information such as methylation into pharmacogenetic models holds the potential to improve their accuracy and consequently prescribing decisions. Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene conventionally associated with the metabolism of commonly used drugs and endogenous substrates. We thus sought to predict epigenetic loci from single nucleotide polymorphisms (SNPs) related to CYP2D6 in children from the GUSTO cohort.MethodsBuffy coat DNA methylation was quantified using the Illumina Infinium Methylation EPIC beadchip. CpG sites associated with CYP2D6 were used as outcome variables in Linear Regression, Elastic Net and XGBoost models. We compared feature selection of SNPs from GWAS mQTLs, GTEx eQTLs and SNPs within 2 MB of the CYP2D6 gene and the impact of adding demographic data. The samples were split into training (75%) sets and test (25%) sets for validation. In Elastic Net model and XGBoost models, optimal hyperparameter search was done using 10-fold cross validation. Root Mean Square Error and R-squared values were obtained to investigate each models’ performance. When GWAS was performed to determine SNPs associated with CpG sites, a total of 15 SNPs were identified where several SNPs appeared to influence multiple CpG sites.ResultsOverall, Elastic Net models of genetic features appeared to perform marginally better than heritability estimates and substantially better than Linear Regression and XGBoost models. The addition of nongenetic features appeared to improve performance for some but not all feature sets and probes. The best feature set and Machine Learning (ML) approach differed substantially between CpG sites and a number of top variables were identified for each model.DiscussionThe development of SNP-based prediction models for CYP2D6 CpG methylation in Singaporean children of varying ethnicities in this study has clinical application. With further validation, they may add to the set of tools available to improve precision medicine and pharmacogenetics-based dosing.
a
Animal10N Training Set
datasets.activeloop.ai
Updated Mar 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Song, Hwanjun and Kim,Minseok and Lee, Jae-Gil. (2022). Animal10N Training Set [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/animal-animal10n-dataset/
Explore at:
Dataset updated
Mar 26, 2022
Authors
Song, Hwanjun and Kim,Minseok and Lee, Jae-Gil.
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Animal10N Training Set consists of 40,000 images of animals from 10 different classes. The images are labeled with the animal's class.
T
cifar100
tensorflow.org
universe.roboflow.com
+4more
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). cifar100 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar100
Explore at:
Dataset updated
Jun 1, 2024
Description
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cifar100', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cifar100-3.0.2.png" alt="Visualization" width="500px">
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yi Cao; Haipeng Deng; Shaoyun Liu; Xi Zeng; Yangyang Gou; Weiting Zhang; Yixinyuan Li; Hua Yang; Min Peng (2025). Data Sheet 2_Development and validation of a machine learning-based risk prediction model for stroke-associated pneumonia in older adult hemorrhagic stroke.docx [Dataset]. http://doi.org/10.3389/fneur.2025.1591570.s002

Data Sheet 2_Development and validation of a machine learning-based risk prediction model for stroke-associated pneumonia in older adult hemorrhagic stroke.docx

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/fneur.2025.1591570.s002

Dataset updated

Jun 18, 2025

Dataset provided by

Frontiers

Authors

Yi Cao; Haipeng Deng; Shaoyun Liu; Xi Zeng; Yangyang Gou; Weiting Zhang; Yixinyuan Li; Hua Yang; Min Peng

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ObjectiveTo develop and validate a machine learning (ML)-based model for predicting stroke-associated pneumonia (SAP) risk in older adult hemorrhagic stroke patients.MethodsA retrospective collection of older adult hemorrhagic stroke patients from three tertiary hospitals in Guiyang, Guizhou Province (January 2019–December 2022) formed the modeling cohort, randomly split into training and internal validation sets (7:3 ratio). External validation utilized retrospective data from January–December 2023. After univariate and multivariate regression analyses, four ML models (Logistic Regression, XGBoost, Naive Bayes, and SVM) were constructed. Receiver operating characteristic (ROC) curves and area under the curve (AUC) were calculated for training and internal validation sets. Model performance was compared using Delong's test or Bootstrap test, while sensitivity, specificity, accuracy, precision, recall, and F1-score evaluated predictive efficacy. Calibration curves assessed model calibration. The optimal model underwent external validation using ROC and calibration curves.ResultsA total of 788 older adult hemorrhagic stroke patients were enrolled, divided into a training set (n = 462), an internal validation set (n = 196), and an external validation set (n = 130). The incidence of SAP in older adult patients with hemorrhagic stroke was 46.7% (368/788). Advanced age [OR = 1.064, 95% CI (1.024, 1.104)], smoking[OR = 2.488, 95% CI (1.460, 4.24)], low GCS score [OR = 0.675, 95% CI (0.553, 0.825)], low Braden score [OR = 0.741, 95% CI (0.640, 0.858)], and nasogastric tube [OR = 1.761, 95% CI (1.048, 2.960)] were identified as risk factors for SAP. Among the four machine learning algorithms evaluated [XGBoost, Logistic Regression (LR), Support Vector Machine (SVM), and Naive Bayes], the LR model demonstrated robust and consistent performance in predicting SAP among older adult patients with hemorrhagic stroke across multiple evaluation metrics. Furthermore, the model exhibited stable generalizability within the external validation cohort. Based on these findings, the LR framework was subsequently selected for external validation, accompanied by a nomogram visualization. The model achieved AUC values of 0.883 (training), 0.855 (internal validation), and 0.882 (external validation). The Hosmer-Lemeshow (H-L) test indicates that the calibration of the model is satisfactory in all three datasets, with P-values of 0.381, 0.142, and 0.066 respectively.ConclusionsThis study constructed and validated a risk prediction model for SAP in older adult patients with hemorrhagic stroke based on multi-center data. The results indicated that among the four machine learning algorithms (XGBoost, LR, SVM, and Naive Bayes), the LR model demonstrated the best and most stable predictive performance. Age, smoking, low GCS score, low Braden score, and nasogastric tube were identified as predictive factors for SAP in these patients. These indicators are easily obtainable in clinical practice and facilitate rapid bedside assessment. Through internal and external validation, the model was proven to have good generalization ability, and a nomogram was ultimately drawn to provide an objective and operational risk assessment tool for clinical nursing practice. It helps in the early identification of high-risk patients and guides targeted interventions, thereby reducing the incidence of SAP and improving patient prognosis.

Clear search

Close search

Google apps

Main menu

	Combined	Maus	Tang
Accuracy	99.78	99.74	99.83
Precision	99.22	99.20	99.24
Recall	95.71	96.34	95.10

Data Sheet 2_Development and validation of a machine learning-based risk...

Data for: Advances and critical assessment of machine learning techniques...

Global ML-ready dataset for mining areas in satellite images

Norwegian Review Corpus

License

UCI dataset

Navas-Olive, Rubio, et al. (2023). Figure 6 - data

Met Office UKCP Local CPM precipitation ML emulator dataset

Met Office UKCP Local CPM precipitation ML emulator dataset

Predictor variables

Predictand variable

Data_Sheet_7_Prediction model of acute kidney injury after different types...

food101

Table_3_Comparing feature selection and machine learning approaches for...

Animal10N Training Set

cifar100

Data Sheet 2_Development and validation of a machine learning-based risk prediction model for stroke-associated pneumonia in older adult hemorrhagic stroke.docxSee More Versions

Data Sheet 2_Development and validation of a machine learning-based risk prediction model for stroke-associated pneumonia in older adult hemorrhagic stroke.docx