Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv
and Trad_Test.csv
) is derived directly from the original complete geochemical dataset (alldata.csv
) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv
and Simu_Test.csv
) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip
archive contains the raw input files alldata.csv
used to generate the proxies_alldata.csv
, it also contains Analysis1.csv
and Analysis2.csv
for performing confidence analysis. To run the executable files in place of the .m
scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.
Analysis1.csv
and Analysis2.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training dataset portioning results using XGBoost.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository includes the RNA-seq dataset from 27 GBM samples, as published in this manuscript:
Topographic mapping of the glioblastoma proteome reveals a triple axis model of intra-tumoral heterogeneity
Lam KHB, Leon AJ, Hui W, Lee SCE, Batruch I, Faust K, Koritzinsky M, Richer M, Djuric U, Diamandis P (under review)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate prediction of water inrush volumes is essential for safeguarding tunnel construction operations. This study proposes a method for predicting tunnel water inrush volumes, leveraging the eXtreme Gradient Boosting (XGBoost) model optimized with Bayesian techniques. To maximize the utility of available data, 654 datasets with missing values were imputed and augmented, forming a robust dataset for the training and validation of the Bayesian optimized XGBoost (BO-XGBoost) model. Furthermore, the SHapley Additive explanations (SHAP) method was employed to elucidate the contribution of each input feature to the predictive outcomes. The results indicate that: (1) The constructed BO-XGBoost model exhibited exceptionally high predictive accuracy on the test set, with a root mean square error (RMSE) of 7.5603, mean absolute error (MAE) of 3.2940, mean absolute percentage error (MAPE) of 4.51%, and coefficient of determination (R2) of 0.9755; (2) Compared to the predictive performance of support vector mechine (SVR), decision tree (DT), and random forest (RF) models, the BO-XGBoost model demonstrates the highest R2 values and the smallest prediction error; (3) The input feature importance yielded by SHAP is groundwater level (h) > water-producing characteristics (W) > tunnel burial depth (H) > rock mass quality index (RQD). The proposed BO-XGBoost model exhibited exceptionally high predictive accuracy on the tunnel water inrush volume prediction dataset, thereby aiding managers in making informed decisions to mitigate water inrush risks and ensuring the safe and efficient advancement of tunnel projects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionSemen quality has decreased gradually in recent years, and lifestyle changes are among the primary causes for this issue. Thus far, the specific lifestyle factors affecting semen quality remain to be elucidated.Materials and methodsIn this study, data on the following factors were collected from 5,109 men examined at our reproductive medicine center: 10 lifestyle factors that potentially affect semen quality (smoking status, alcohol consumption, staying up late, sleeplessness, consumption of pungent food, intensity of sports activity, sedentary lifestyle, working in hot conditions, sauna use in the last 3 months, and exposure to radioactivity); general factors including age, abstinence period, and season of semen examination; and comprehensive semen parameters [semen volume, sperm concentration, progressive and total sperm motility, sperm morphology, and DNA fragmentation index (DFI)]. Then, machine learning with the XGBoost algorithm was applied to establish a primary prediction model by using the collected data. Furthermore, the accuracy of the model was verified via multiple logistic regression following k-fold cross-validation analyses.ResultsThe results indicated that for semen volume, sperm concentration, progressive and total sperm motility, and DFI, the area under the curve (AUC) values ranged from 0.648 to 0.697, while the AUC for sperm morphology was only 0.506. Among the 13 factors, smoking status was the major factor affecting semen volume, sperm concentration, and progressive and total sperm motility. Age was the most important factor affecting DFI. Logistic combined with cross-validation analysis revealed similar results. Furthermore, it showed that heavy smoking (>20 cigarettes/day) had an overall negative effect on semen volume and sperm concentration and progressive and total sperm motility (OR = 4.69, 6.97, 11.16, and 10.35, respectively), while age of >35 years was associated with increased DFI (OR = 5.47).ConclusionThe preliminary lifestyle-based model developed for semen quality prediction by using the XGBoost algorithm showed potential for clinical application and further optimization with larger training datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.
The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.
The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.
The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:
Temperature
Humidity
Wind Speed
Pressure
Rainfall (target variable)
These features are tracked for each weather station over different times, with the goal of predicting rainfall.
Python: The primary programming language for data analysis and machine learning.
scikit-learn: For implementing machine learning models.
XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
Matplotlib/Seaborn: For data visualization.
These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.
Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease Mpro (PDB ID: 6WQF). Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only...
We investigate the performance of machine-learning techniques in classifying active galactic nuclei (AGNs), including X-ray-selected AGNs (XAGNs), infrared-selected AGNs (IRAGNs), and radio-selected AGNs (RAGNs). Using the known physical parameters in the Cosmic Evolution Survey (COSMOS) field, we are able to create quality training samples in the region of the Hyper Suprime-Cam (HSC) survey. We compare several Python packages (e.g., scikit- learn, Keras, and XGBoost) and use XGBoost to identify AGNs and show the performance (e.g., accuracy, precision, recall, F1 score, and AUROC). Our results indicate that the performance is high for bright XAGN and IRAGN host galaxies. The combination of the HSC (optical) information with the Wide-field Infrared Survey Explorer band 1 and band 2 (near-infrared) information performs well to identify AGN hosts. For both type 1 (broad-line) XAGNs and type 1 (unobscured) IRAGNs, the performance is very good by using optical-to-infrared information. These results can apply to the five-band data from the wide regions of the HSC survey and future all-sky surveys. Cone search capability for table J/ApJ/920/68/table7 (AGN candidates in HSC-Wide region for 112609 objects)
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Acoustic signals are vital in animal communication, and quantifying these signals them is fundamental for understanding animal behaviour and ecology. Vocaliszations can be classified into acoustically and functionally or contextually distinct categories, but establishing these categories can be challenging. Newly developed methods, such as machine learning, can provide solutions for classification tasks. The plains zebra is known for its loud and specific vocaliszations, yet limited knowledge exists on the structure and information content of its vocaliszations. In this study, we employed both feature-based and spectrogram-based algorithms, incorporating supervised and unsupervised machine learning methods to enhance robustness in categoriszing zebra vocaliszation types. Additionally, we implemented a permuted discriminant function analysis (pDFA) to examine the individual identity information contained in the identified vocaliszation types. The findings revealed at least four distinct vocaliszation types he ‘“snort’,” the ‘“soft snort’,” the ‘“squeal’,” and the ‘“quagga quagga’” with individual differences observed mostly in snorts, and to a lesser extent in squeals. Analyses based on acoustic features outperformed those based on spectrograms, but each excelled in characteriszing different vocaliszation types. We thus recommend the combined use of these two approaches. OuThisr study offers valuable insights into plains zebra vocaliszation, with implications for future comprehensive explorations in animal communication. Methods Data collection and sampling We collected data in three locations, in Denmark and South Africa: 1) 10 months between December 2020 and July 2021 and between September and December 2021, at Pilanesberg National Park (hereafter “PNP”), South Africa, covering both dry season (i.e. from May to September) and wet season (i.e. from October to April) (1); 2) 16 days between May and June 2019, and 33 days between February and May 2022, at Knuthenborg Safari Park (hereafter “KSP”), Denmark, covering both periods before the park’s opening for tourists (i.e. from November to March) and after (i.e. from April to October); 3) 4 days in August 2019 at Givskud Zoo (hereafter “GKZ”), Denmark. For all places and periods, three types of data were collected as follows: 1) Pictures were taken for each individual from both sides using a camera (Nikon COOLPIX P950); 2) Contexts of vocal production were recorded either through notes (in the first period of KSP and in GKZ) or videos (in the second period of KSP and in PNP) filmed by a video camera recorder (Sony HDRPJ410 HD); 3) Audio recordings were collected using a directional microphone (Sennheiser MKH-70 P48, with a frequency response of 50 - 20000 Hz (+/- 2.5 dB)) linked to an audio recorder (Marantz PMD661 MKIII). Six zebras housed in GKZ were recorded while being separated from one another into three enclosures (the stable, the small enclosure and the savannah) manually by the zookeeper for management purpose, which triggered vocalisations. These vocalisations, along with other types of data, were recorded at distances of 5 - 30 m. In KSP, 15 - 18 zebras (population changed due to newborns, deaths, or removal of adult males) were living with other herbivores in a 0.14 km2 savannah. There, we approached the zebras by driving down the road until approximately 7 - 40 m, at which point spontaneous vocalisations and other information were collected. This distance allowed us to collect good quality recordings without eliciting any obvious reactions from the zebras to our presence. Finally, PNP is a 580 km2 national park, with approximately 800 - 2000 zebras (2). In this park, we drove on the road and parked at distances of 10 - 80 m when encountering zebras, where all data, including spontaneous vocalisations, were recorded. Data processing Individual zebras were manually identified based on the pictures collected from KSP and GKZ (15-18 and 6 zebras, respectively). In PNP, the animals present in the pictures were individually identified using WildMe (https://zebra.wildme.org/), a web-based machine learning platform facilitating individual recognition. All zebra pictures were uploaded to the platform for a full comparison through the algorithm. The resulting matching candidates were then determined by manually reviewing the output. Audio files (sampling rate: 44100 Hz) were saved at 16-bit amplitude resolution in WAV format. We annotated zebra vocalisations, along with context and individuals emitting the vocalisations, using Audacity software (version 3.3.3) (3). Vocalisations were first subjectively labelled as five vocalisation types based on both audio and spectrogram examinations (i.e. visually inspection) (Table 1 and Figure 1). Among these types, the “squeal-snort” was excluded from further analysis, as the focus of this study was on individual vocalisation types instead of combinations. Acoustic analysis We extracted vocalisations of good quality, defined as vocalisations with clear spectrograms, low background noise, and no overlap with other sounds, and saved them as distinct audio files. For the individual distinctiveness analysis, we excluded individuals with fewer than 5 vocalisations of each type, to avoid strong imbalance, resulting in 359 snorts from 28 individuals and 138 squeals from 14 individuals (Table S3 and S4) (4, 5). The individuality content of quagga quagga and soft snorts could not be explored, due to insufficient individual data. For vocal repertoire analysis, we excluded vocalisations longer than 1.25 s to improve spectrogram-based analysis, following Thomas et al (6). In total, we gathered 678 vocalisations for the spectrogram-based vocal repertoire analysis, including 117 quagga quagga, 204 snorts, 161 squeals and 196 soft snorts (Table S2). Among these vocalisations, six squeals were excluded in the acoustic feature-based vocal repertoire analysis, due to missing data for one of the features (amplitude modulation extent). All calls were first high-passed filtered above 30 Hz for snorts and soft snorts, above 500 Hz for squeals and above 600 Hz for quagga quagga (i.e. above the average minimum fundamental frequency of these vocalisations; Table S1). We then extracted 12 acoustic features from vocalisations for the individual distinctiveness analysis (Table 2), using a custom script (7-10) in Praat software (11). Eight of these features were also extracted for the vocal repertoire analysis (i.e. all features except those related to the fundamental frequency, which were not available for soft snorts that are not tonal). Additionally, to explore the vocal repertoire, mel-spectrograms were generated from audio files using STFT, following Thomas et al. (6). Spectrograms were padded with zeros according to the length of the longest audio file to ensure uniform length for all audio files, and time-shift adjustments were implemented to align the starting points of vocalisations (6). Statistical analyses a. Vocal repertoire We applied both supervised and unsupervised machine learning to both acoustic features and spectrogram using Python (version 3.9.7) (12). Supervised method. To define the vocal repertoire via an acoustic feature-based approach, we deployed feature importance analysis by SHapley Additive exPlanation (SHAP) (13), using the shap library (version 0.40.0) (14). Six features with SHAP value > 1 were selected (Figure S1). We split the selected features with vocalisation type labels into a training dataset (70%) and a testing dataset (30%) using the Scikit-learn library (function: train_test_split, version 0.24.2) (15). Subsequently, we employed a supervised approach, the eXtreme Gradient Boosting (XGBoost) classifier in xgboost library (version 1.6.0) (16) to train the model. Three hyperparameters were tuned on the training dataset to reach maximum accuracy using optuna library (direction = minimize, n_trials = 200, version 2.10.0) (17), incorporating cross validation (five folds), which resulted in the best model (Table S5). To define the vocal repertoire via a spectrogram-based approach, we split the dataset into a training set (49%), a validation set (21%), and a test dataset (30%), using the Scikit-learn library (function: train_test_split, version 0.24.2) (15). We implemented a Convolutional Neural Network (CNN) architecture using the tensorflow library (version 2.8.0) (18). The architecture was constructed (Table S6) and seven hyperparameters were tuned to reach maximum accuracy on the training and validation dataset using the optuna library (direction = minimize, n_trials = 50, version 2.10.0) (17), which resulted in the best model (Table S6). We evaluated model performance for both feature-based and spectrogram-based classification models through predictions on each test dataset, including the test accuracy across all call types (number of correct predictions / total number of predictions), and three metrics for each call type; precision (true positives / (true positive + false positives)), recall (true positives / (true positives + false negatives) and the harmonic mean of precision and recall — f1-score (2 × (precision × recall) / (precision + recall) (19). We also plotted the confusion matrix between true classes and predicted classes. Unsupervised method. For both acoustic feature-based and spectrogram-based analyses, we applied Uniform Manifold Approximation and Projection (UMAP) in the umap library (function: umap.UMAP, n_neighbors=200 and local_connectivity= 150 for acoustic feature-based analysis, and metric = calc_timeshift_pad and min_dist = 0 for spectrogram-based analysis, version 0.1.1) (20), to reduce variables into a 2-dimensional latent space. We also implemented k-means clustering algorithm for both analyses from the Scikit-learn library (function: kmeans.fit, version 0.24.2) (15), to identify distinct clusters using the elbow method (21). The
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🇬🇧 English:
This synthetic dataset helps build machine learning models to predict whether a patient is at risk of heart disease. It includes patient attributes such as age, cholesterol, blood pressure, sex, and diabetes history.
Use this dataset to:
Features:
🇹🇷 Türkçe:
Bu sentetik veri seti, hastaların kalp hastalığı riski taşıyıp taşımadığını tahmin etmeye yönelik makine öğrenmesi modelleri geliştirmek için tasarlanmıştır. Yaş, kolesterol, tansiyon, cinsiyet ve diyabet bilgileri gibi özellikleri içerir.
Bu veri seti ile:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains predictions of occurrence probability for ticks in Great Britain (2014 to 2021) at 1 km spatial resolution + all covariate layers used for modeling. Over seven million electronic health records (EHRs), among which 11,741 EHRs reported tick attachment, were used to evaluate climate, environmental and animal host factors affecting the risk of tick attachment in cats and dogs in Great Britain (GB). The tick presence/absence EHRs for dogs and cats were further overlaid with spatiotemporal time-series of climatic, vegetation, human influence, hydrological and terrain variables (slope, wetness index) to produce a spatiotemporal regression matrix; an Ensemble Machine Learning framework was used to fine-tune hyperparameters for Random Forest (classif.ranger), Gradient boosting (classif.xgboost) and GLM-net (classif.glmnet) algorithms, which were then used to produce a final ensemble meta-learner that predicts the probability of occurrence of ticks across GB with monthly intervals. gb1km_covariates.zip contains ALL covariate layers as GeoTIFFs (time-series) used for modeling ticks dynamics; data_1km_2014_M01.rds = contains all covariates for January 2014 prepared as SpatialGridDataFrame (R data object); Codes of files indicate e.g.: "monthly.tick.prob_savsnet.mar_p_1km_s_2014_2021" = monthly occurrence probability for January based on the training data from 2014 to 2021; "monthly.tick.prob_savsnet.oct_md_1km_s_20211001_20211031" = monthly prediction (model) error derived as the standard deviation from multiple base learners; The dataset is described in detail in the following publication: Arsevska, E., Hengl, T., Singelton, D. et al. (2023?) Risk factors for tick attachment in companion animals in Great Britain: a spatiotemporal analysis covering 2014–2021. Submitted to Parasites & Vectors (in review). The model summary shows: Call: stats::glm(formula = f, family = "binomial", data = getTaskData(.task, .subset), weights = .weights, model = FALSE) Deviance Residuals: Min 1Q Median 3Q Max -1.4749 -0.0557 -0.0471 -0.0430 3.7611 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.64495 0.02095 -364.957 < 2e-16 *** classif.ranger 4.95061 0.63615 7.782 7.13e-15 *** classif.xgboost 189.75543 5.53109 34.307 < 2e-16 *** classif.glmnet 140.24208 5.05375 27.750 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 170604 on 7303013 degrees of freedom Residual deviance: 162571 on 7303010 degrees of freedom AIC: 162579 Number of Fisher Scoring iterations: 9 Acknowledgements: We are grateful to data providers in veterinary practice (VetSolutions, Teleos, CVS, and other practitioners). We are grateful to the INRAE MIGALE bioinformatics facility (MIGALE, INRAE, 2020. Migale Bioinformatics Facility, doi: 10.15454/1.5572390655343293E12) for providing computing resources. We are also grateful for the help and support provided by SAVSNET team members Bethaney Brant, Susan Bolan and Steven Smyth. This study was funded mainly by a grant from the Biotechnology and Biological Sciences Research Council, BB/NO19547/1 and British Small Animal Veterinary Association (BSAVA). The research was partly funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Emerging and Zoonotic Infections at the University of Liverpool in partnership with Public Health England (PHE) and Liverpool School of Tropical Medicine (LSTM). This work has been partially funded by the “Monitoring outbreak events for disease surveillance in a data science context" (MOOD) project from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 874850 (https://mood-h2020.eu/). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health or Public Health England.
This dataset contains several data, results and processing material from the application of GEOBIA-based, Spatially Partitioned Segmentation Parameter Optimization (SPUSPO) in the city of Ouagadougou. In detail in contains: A Land Use - Land Cover map of Ouagadougou derived through SPUSPO. The classifier used was Extreme Gradient Boosting (XGBoost). Labels : 2 : Artificial Ground Surface 0 : Building 5 : Low Vegetation 4 : Tree 1 : Swimming Pool 3 : Bare Ground 7 : Shadow 6 : Inland Water The training and test data used in the study (SPUSPO and benchmark approach). The data are given in a csv format. The Jupyter notebook code which involves Python and GRASS GIS to automatize and efficiently perform SPUSPO in a large dataset. Python code calling GRASS GIS functions for automatizing the procedure. The segmentation layers coming from SPUSPO and the benchmark approaches (in raster formats due to data limitations). Segmentation rasters for each approach. The R code for optimization of XGBoost as well as feature selection with VSURF and classification of the whole dataset. Segmentation evaluation metrics. A csv file with the data sued to compute the Area Fit Index for each approach. Morphological zones of Ouagadougou as created by Grippa et al. 2017 a shp format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset, from the paper 'Forecasting high-impact research topics via machine learning on evolving knowledge graphs' by Xuemei Gu and Mario Krenn, includes benchmark data for evaluating fully connected NNs, Transformers, Random Forest, and XGBoost on prediction tasks with 2-4 year training and 1-5 year evaluation intervals across two impact ranges (IR). It also provides some examples about 10M evaluation samples (2019-2022) with raw outputs from a neural network trained on 2016-2019 data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data contain of: 1. Data of satellite imagery of PlanetScope of University of Brawijaya with 3m spatial resolution. 2. Data training and testing in CSV format 3. R Script of four different algorithms (XGBoost, Random Forest, Support Vector Machine, and Neural Networks) The manuscript that using this dataset has been submitted to F1000 Research (https://f1000research.com/)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset reconstructs the annual mass balance of glaciers larger than 0.1 km² in the Tien Shan and Pamir regions from 1950 to 2022. The dataset is derived using a nonlinear relationship between glacier mass balance and meteorological and topographical variables. The reconstruction method employs the XGBoost algorithm. Initially, XGBoost is trained on the complete training dataset, followed by incremental training for each sub-region to tailor models to specific regional characteristics. The final training results yield an average coefficient of determination (R²) of 0.87.
All code used in this dataset is publicly available and organized into the following five sections:
Data Processing
Model Training
Result Analysis
Result Evaluation
SHAP Analysis
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Models and Predictions
This dataset contains the trained XGBoost and EA-LSTM models and the models' predictions for the paper The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction.
For each combination of model (XGBoost, EA-LSTM), training years (3, 6, 9), number of basins (13, 26, 53, 265, 531), and seed (111-888), there are five folders. Each corresponds to a random basin sample (for 531 basins there's only one folder, since it's all basins).
In each folder, there are two files:
In addition to each folder, there is a SLURM submission script called \(\texttt{ that was used to create and evaluate the model in the folder.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%–5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the macro-precision for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage.
Gross Primary Production (GPP) represents the total amount of carbon fixed by plants through photosynthesis in an ecosystem over a specific period. GPP data products are derived using a data-driven approach that integrates Earth observation data with in-situ carbon flux measurements. Specifically, GPP estimations combine Sentinel-2 multispectral imagery with carbon flux data from eddy covariance towers, employing the XGBoost machine learning algorithm for prediction. The resulting GPP maps are generated at a 10-meter spatial resolution and a temporal frequency up to 5 days, covering the period from March 2017 to December 2023. The temporal resolution is contingent upon 50% free-cloud conditions in the area of interest, with lower frequencies occurring during periods of high cloud coverage. The spatial extent of the GPP maps corresponds to the boundaries of long-term observation sites as recorded in the DEIMS-SDR registry (e.g., https://deims.org/6f716444-c0bd-4a04-b72b-add3e302eef1). For sites where the boundary area is smaller than 1 km², or if only point coordinates are available in DEIMS-SDR, the maps are constrained to a 1 km x 1 km bounding box. The methodology integrates multiple data sources via machine learning techniques to estimate Gross Primary Production (GPP) across different ecosystems. The process begins with data pre-processing, including the selection of sites based on criteria such available vegetation information, at least three full years of eddy covariance flux data. GPP and environmental data (e.g. air temperature, vapor pressure deficit, etc.) are extracted from the ICOS database across different ecosystem types. Then, different remote sensing (RS) indices (e.g NDVI, EVI, etc.) are estimated in GEE using Sentinel-2 MSI data as the mean value of the pixels found inside the climatological footprint 70 (an area were 70% of the GPP measurements are coming from). Both data coming from ICOS dataset and RS indices are used as predictors for the model. The data is split, with 70% used for training and 30% for testing. In the model setup, an XGBoost model is trained using the selected environmental and RS based indices. The model parameters are fine-tuned to improve accuracy. The remaining 30% of testing data is used to evaluate the model’s performance by comparing its predictions against in-situ GPP data. Error metrics like Mean Absolute Error (MAE), Root Mean Absolute Error (RMAE), and R² are provided. The maps computation phase applies the trained model to ecosystem boundaries from the DEIMS website to generate 5-day GPP maps. Acknowledgement This work on the AGAME Gross Primary Production data product is funded by the European Space Agency (ESA, contract no. 4000143740/24/I-AG) in the frame of the GEOSS Platform Plus project (Horizon Europe, GA No. GA.Nr. 101039118). The work done is based on the requirements from eLTER contributing in addition to the eLTER Site Information Cluster. In-situ data for model calibration and validation has been derived from the ICOS Carbon Portal.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of two curated subsets designed for the classification of alteration types using geochemical and proxy variables. The traditional dataset (Trad_Train.csv
and Trad_Test.csv
) is derived directly from the original complete geochemical dataset (alldata.csv
) without any missing values and includes original geochemical features, serving as a baseline for model training and evaluation. In contrast, the simulated dataset (proxies_alldata.csv) was generated through custom MATLAB scripts that transform the original geochemical features into proxy variables based on multiple geostatistical realizations. These proxies, expressed on a Gaussian scale, may include negative values due to normalization. The target variable, Alteration, was originally encoded as integers using the mapping: 1 = AAA, 2 = IAA, 3 = PHY, 4 = PRO, 5 = PTS, and 6 = UAL. The simulated proxy data was split into the simulated train and test files (Simu_Train.csv
and Simu_Test.csv
) based on encoded details for the training (=1) and testing data (=2). All supporting files—including datasets, intermediate outputs (e.g., PNGs, variograms), proxy outputs, and an executable for confidence analysis routines are included in the repository except the source code, which is on GitHub Repository. Specifically, the FinalMatlabFiles.zip
archive contains the raw input files alldata.csv
used to generate the proxies_alldata.csv
, it also contains Analysis1.csv
and Analysis2.csv
for performing confidence analysis. To run the executable files in place of the .m
scripts in MATLAB, users must install the MATLAB Runtime 2023b for Windows 64-bit, available at: https://ssd.mathworks.com/supportfiles/downloads/R2023b/Release/10/deployment_files/installer/complete/win64/MATLAB_Runtime_R2023b_Update_10_win64.zip.
Analysis1.csv
and Analysis2.csv