Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveTo develop and validate a machine learning (ML)-based model for predicting stroke-associated pneumonia (SAP) risk in older adult hemorrhagic stroke patients.MethodsA retrospective collection of older adult hemorrhagic stroke patients from three tertiary hospitals in Guiyang, Guizhou Province (January 2019–December 2022) formed the modeling cohort, randomly split into training and internal validation sets (7:3 ratio). External validation utilized retrospective data from January–December 2023. After univariate and multivariate regression analyses, four ML models (Logistic Regression, XGBoost, Naive Bayes, and SVM) were constructed. Receiver operating characteristic (ROC) curves and area under the curve (AUC) were calculated for training and internal validation sets. Model performance was compared using Delong's test or Bootstrap test, while sensitivity, specificity, accuracy, precision, recall, and F1-score evaluated predictive efficacy. Calibration curves assessed model calibration. The optimal model underwent external validation using ROC and calibration curves.ResultsA total of 788 older adult hemorrhagic stroke patients were enrolled, divided into a training set (n = 462), an internal validation set (n = 196), and an external validation set (n = 130). The incidence of SAP in older adult patients with hemorrhagic stroke was 46.7% (368/788). Advanced age [OR = 1.064, 95% CI (1.024, 1.104)], smoking[OR = 2.488, 95% CI (1.460, 4.24)], low GCS score [OR = 0.675, 95% CI (0.553, 0.825)], low Braden score [OR = 0.741, 95% CI (0.640, 0.858)], and nasogastric tube [OR = 1.761, 95% CI (1.048, 2.960)] were identified as risk factors for SAP. Among the four machine learning algorithms evaluated [XGBoost, Logistic Regression (LR), Support Vector Machine (SVM), and Naive Bayes], the LR model demonstrated robust and consistent performance in predicting SAP among older adult patients with hemorrhagic stroke across multiple evaluation metrics. Furthermore, the model exhibited stable generalizability within the external validation cohort. Based on these findings, the LR framework was subsequently selected for external validation, accompanied by a nomogram visualization. The model achieved AUC values of 0.883 (training), 0.855 (internal validation), and 0.882 (external validation). The Hosmer-Lemeshow (H-L) test indicates that the calibration of the model is satisfactory in all three datasets, with P-values of 0.381, 0.142, and 0.066 respectively.ConclusionsThis study constructed and validated a risk prediction model for SAP in older adult patients with hemorrhagic stroke based on multi-center data. The results indicated that among the four machine learning algorithms (XGBoost, LR, SVM, and Naive Bayes), the LR model demonstrated the best and most stable predictive performance. Age, smoking, low GCS score, low Braden score, and nasogastric tube were identified as predictive factors for SAP in these patients. These indicators are easily obtainable in clinical practice and facilitate rapid bedside assessment. Through internal and external validation, the model was proven to have good generalization ability, and a nomogram was ultimately drawn to provide an objective and operational risk assessment tool for clinical nursing practice. It helps in the early identification of high-risk patients and guides targeted interventions, thereby reducing the incidence of SAP and improving patient prognosis.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF).
Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected.
The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study.
The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a global resource for machine learning applications in mining area detection and semantic segmentation on satellite imagery. It contains Sentinel-2 satellite images and corresponding mining area masks + bounding boxes for 1,210 sites worldwide. Ground-truth masks are derived from Maus et al. (2022) and Tang et al. (2023), and validated through manual verification to ensure accurate alignment with Sentinel-2 imagery from specific timestamps.
The dataset includes three mask variants:
Each tile corresponds to a 2048x2048 pixel Sentinel-2 image, with metadata on mine type (surface, placer, underground, brine & evaporation) and scale (artisanal, industrial). For convenience, the preferred mask dataset is already split into training (75%), validation (15%), and test (10%) sets.
Furthermore, dataset quality was validated by re-validating test set tiles manually and correcting any mismatches between mining polygons and visually observed true mining area in the images, resulting in the following estimated quality metrics:
Combined | Maus | Tang | |
Accuracy | 99.78 | 99.74 | 99.83 |
Precision | 99.22 | 99.20 | 99.24 |
Recall | 95.71 | 96.34 | 95.10 |
Note that the dataset does not contain the Sentinel-2 images themselves but contains a reference to specific Sentinel-2 images. Thus, for any ML applications, the images must be persisted first. For example, Sentinel-2 imagery is available from Microsoft's Planetary Computer and filterable via STAC API: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a. Additionally, the temporal specificity of the data allows integration with other imagery sources from the indicated timestamp, such as Landsat or other high-resolution imagery.
Source code used to generate this dataset and to use it for ML model training is available at https://github.com/SimonJasansky/mine-segmentation. It includes useful Python scripts, e.g. to download Sentinel-2 images via STAC API, or to divide tile images (2048x2048px) into smaller chips (e.g. 512x512px).
A database schema, a schematic depiction of the dataset generation process, and a map of the global distribution of tiles are provided in the accompanying images.
This dataset is a slightly modified version of the NoReC dataset for document-level sentiment analysis. The data points remain unchanged, with the only adjustment being the compilation into a CSV file for ease of use. This straightforward approach ensures the dataset's simplicity and accessibility while preserving the authenticity of the original content.
Usage The dataset contains a split column which can be used to split the dataset into training, validation and test sets. However, feel free to split the dataset as you see fit.
CC-BY-NC
Original Data Source: Norwegian Review Corpus
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Cuff-Less Blood Pressure Estimation Dataset [2] from the UCI Machine Learning Repository. It is a subset of the MIMIC-II Waveform Dataset that contains 12000 records of simultaneous PPG and ABP from 942 patients with a sampling rate of 125 Hz. The 12000 records were uniformly split into four parts with 3000 records each. However, as the subject information is lacking, the Hold-one-out strategy was utilized to generate training, validation, and test sets once the data was preprocessed. In the end, the UCI dataset had 291,078 segments, which was around 404 hours of recording, making it substantially the biggest data set with a considerably higher ratio of continuous segments per record (32.15).
[2] Kachuee, M., Kiani, M. M., Mohammadzade, H. & Shabany, M. Cuff-less blood pressure estimation data set (2015). UCI repository https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Figure 6. Extending sharp-wave ripple detection to non-human primates. c) Significant differences between SWR recorded in mice and monkey. d) The best model of each architecture trained in mouse data, and the best filter configuration for mouse data, were applied to detect SWRs on the macaque data. We evaluated all models by computing F1-score against the ground truth (GT). Note relatively good results from non-retrained ML models and filter. e) Results of model re-training using macaque data. Data were split into a training and validation dataset (50% and 20% respectively), used to train the ML models; and a test set (30%), used to compute the F1 (left panel). Filter was not re-trained. f) F1-scores for the maximal performance of each model before and after re-training.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of two datasets: one sourced from CPM data (bham_gcmx-4x_12em_psl-sphum4th-temp4th-vort4th_eqvt_random-season.tar.gz) and one sourced from GCM data (bham_60km-4x_12em_psl-sphum4th-temp4th-vort4th_eqvt_random-season.tar.gz). Each dataset is made up of climate model variables extracted from the Met Office's storage system, combining many variables over many years. It consists of 3 NetCDF files (train.nc, test.nc and val.nc), a YML ds-config.yml file and a README (similar to this one but tailored to the source of the data). Code used to create the dataset can be found here: https://github.com/henryaddison/mlde-data (specifically the v0.1.0 tag: https://github.com/henryaddison/mlde-data/tree/v0.1.0).
The YML file contains the configuration for the creation of the dataset, including the variables, scenario, ensemble members, spatial domain and resolution, and the scheme for splitting the data across the three subsets.
Each NetCDF contains the same variables but split into different subsets (train, val and test) of the based on time dimension.
Otherwise the NetCDF files have the sames dimensions and coordinates for ensemble_member, grid_longitude and grid_latitude.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveA clinical prediction model for postoperative combined Acute kidney injury (AKI) in patients with Type A acute aortic dissection (TAAAD) and Type B acute aortic dissection (TBAAD) was constructed by using Machine Learning (ML).MethodsBaseline data was collected from Acute aortic division (AAD) patients admitted to First Affiliated Hospital of Xinjiang Medical University between January 1, 2019 and December 31, 2021. (1) We identified baseline Serum creatinine (SCR) estimation methods and used them as a basis for diagnosis of AKI. (2) Divide their total datasets randomly into Training set (70%) and Test set (30%), Bootstrap modeling and validation of features using multiple ML methods in the training set, and select models corresponding to the largest Area Under Curve (AUC) for follow-up studies. (3) Screening of the best ML model variables through the model visualization tools Shapley Addictive Explanations (SHAP) and Recursive feature reduction (REF). (4) Finally, the pre-screened prediction models were evaluated using test set data from three aspects: discrimination, Calibration, and clinical benefit.ResultsThe final incidence of AKI was 69.4% (120/173) in 173 patients with TAAAD and 28.6% (81/283) in 283 patients with TBAAD. For TAAAD-AKI, the Random Forest (RF) model showed the best prediction performance in the training set (AUC = 0.760, 95% CI:0.630–0.881); while for TBAAD-AKI, the Light Gradient Boosting Machine (LightGBM) model worked best (AUC = 0.734, 95% CI:0.623–0.847). Screening of the characteristic variables revealed that the common predictors among the two final prediction models for postoperative AKI due to AAD were baseline SCR, Blood urea nitrogen (BUN) and Uric acid (UA) at admission, Mechanical ventilation time (MVT). The specific predictors in the TAAAD-AKI model are: White blood cell (WBC), Platelet (PLT) and D dimer at admission, Plasma The specific predictors in the TBAAD-AKI model were N-terminal pro B-type natriuretic peptide (BNP), Serum kalium, Activated partial thromboplastin time (APTT) and Systolic blood pressure (SBP) at admission, Combined renal arteriography in surgery. Finally, we used in terms of Discrimination, the ROC value of the RF model for TAAAD was 0.81 and the ROC value of the LightGBM model for TBAAD was 0.74, both with good accuracy. In terms of calibration, the calibration curve of TAAAD-AKI's RF fits the ideal curve the best and has the lowest and smallest Brier score (0.16). Similarly, the calibration curve of TBAAD-AKI's LightGBM model fits the ideal curve the best and has the smallest Brier score (0.15). In terms of Clinical benefit, the best ML models for both types of AAD have good Net benefit as shown by Decision Curve Analysis (DCA).ConclusionWe successfully constructed and validated clinical prediction models for the occurrence of AKI after surgery in TAAAD and TBAAD patients using different ML algorithms. The main predictors of the two types of AAD-AKI are somewhat different, and the strategies for early prevention and control of AKI are also different and need more external data for validation.
This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('food101', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionPharmacogenetics currently supports clinical decision-making on the basis of a limited number of variants in a few genes and may benefit paediatric prescribing where there is a need for more precise dosing. Integrating genomic information such as methylation into pharmacogenetic models holds the potential to improve their accuracy and consequently prescribing decisions. Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene conventionally associated with the metabolism of commonly used drugs and endogenous substrates. We thus sought to predict epigenetic loci from single nucleotide polymorphisms (SNPs) related to CYP2D6 in children from the GUSTO cohort.MethodsBuffy coat DNA methylation was quantified using the Illumina Infinium Methylation EPIC beadchip. CpG sites associated with CYP2D6 were used as outcome variables in Linear Regression, Elastic Net and XGBoost models. We compared feature selection of SNPs from GWAS mQTLs, GTEx eQTLs and SNPs within 2 MB of the CYP2D6 gene and the impact of adding demographic data. The samples were split into training (75%) sets and test (25%) sets for validation. In Elastic Net model and XGBoost models, optimal hyperparameter search was done using 10-fold cross validation. Root Mean Square Error and R-squared values were obtained to investigate each models’ performance. When GWAS was performed to determine SNPs associated with CpG sites, a total of 15 SNPs were identified where several SNPs appeared to influence multiple CpG sites.ResultsOverall, Elastic Net models of genetic features appeared to perform marginally better than heritability estimates and substantially better than Linear Regression and XGBoost models. The addition of nongenetic features appeared to improve performance for some but not all feature sets and probes. The best feature set and Machine Learning (ML) approach differed substantially between CpG sites and a number of top variables were identified for each model.DiscussionThe development of SNP-based prediction models for CYP2D6 CpG methylation in Singaporean children of varying ethnicities in this study has clinical application. With further validation, they may add to the set of tools available to improve precision medicine and pharmacogenetics-based dosing.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Animal10N Training Set consists of 40,000 images of animals from 10 different classes. The images are labeled with the animal's class.
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar100', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar100-3.0.2.png" alt="Visualization" width="500px">
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveTo develop and validate a machine learning (ML)-based model for predicting stroke-associated pneumonia (SAP) risk in older adult hemorrhagic stroke patients.MethodsA retrospective collection of older adult hemorrhagic stroke patients from three tertiary hospitals in Guiyang, Guizhou Province (January 2019–December 2022) formed the modeling cohort, randomly split into training and internal validation sets (7:3 ratio). External validation utilized retrospective data from January–December 2023. After univariate and multivariate regression analyses, four ML models (Logistic Regression, XGBoost, Naive Bayes, and SVM) were constructed. Receiver operating characteristic (ROC) curves and area under the curve (AUC) were calculated for training and internal validation sets. Model performance was compared using Delong's test or Bootstrap test, while sensitivity, specificity, accuracy, precision, recall, and F1-score evaluated predictive efficacy. Calibration curves assessed model calibration. The optimal model underwent external validation using ROC and calibration curves.ResultsA total of 788 older adult hemorrhagic stroke patients were enrolled, divided into a training set (n = 462), an internal validation set (n = 196), and an external validation set (n = 130). The incidence of SAP in older adult patients with hemorrhagic stroke was 46.7% (368/788). Advanced age [OR = 1.064, 95% CI (1.024, 1.104)], smoking[OR = 2.488, 95% CI (1.460, 4.24)], low GCS score [OR = 0.675, 95% CI (0.553, 0.825)], low Braden score [OR = 0.741, 95% CI (0.640, 0.858)], and nasogastric tube [OR = 1.761, 95% CI (1.048, 2.960)] were identified as risk factors for SAP. Among the four machine learning algorithms evaluated [XGBoost, Logistic Regression (LR), Support Vector Machine (SVM), and Naive Bayes], the LR model demonstrated robust and consistent performance in predicting SAP among older adult patients with hemorrhagic stroke across multiple evaluation metrics. Furthermore, the model exhibited stable generalizability within the external validation cohort. Based on these findings, the LR framework was subsequently selected for external validation, accompanied by a nomogram visualization. The model achieved AUC values of 0.883 (training), 0.855 (internal validation), and 0.882 (external validation). The Hosmer-Lemeshow (H-L) test indicates that the calibration of the model is satisfactory in all three datasets, with P-values of 0.381, 0.142, and 0.066 respectively.ConclusionsThis study constructed and validated a risk prediction model for SAP in older adult patients with hemorrhagic stroke based on multi-center data. The results indicated that among the four machine learning algorithms (XGBoost, LR, SVM, and Naive Bayes), the LR model demonstrated the best and most stable predictive performance. Age, smoking, low GCS score, low Braden score, and nasogastric tube were identified as predictive factors for SAP in these patients. These indicators are easily obtainable in clinical practice and facilitate rapid bedside assessment. Through internal and external validation, the model was proven to have good generalization ability, and a nomogram was ultimately drawn to provide an objective and operational risk assessment tool for clinical nursing practice. It helps in the early identification of high-risk patients and guides targeted interventions, thereby reducing the incidence of SAP and improving patient prognosis.