82 datasets found

u
Data from: Multi-Sensor Integration and Machine Learning for High-Resolution...
agdatacommons.nal.usda.gov
xlsx
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iddy Muzzo; Kelvyn Bladen; Andres Perea; Shelemia Nyamuryekung'e; Juan J. Villalba (2025). Multi-Sensor Integration and Machine Learning for High-Resolution Classification of Herbivore Foraging Behavior [Dataset]. http://doi.org/10.15482/USDA.ADC/28507400.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/28507400.v1
Dataset updated
May 16, 2025
Dataset provided by
Ag Data Commons
Authors
Iddy Muzzo; Kelvyn Bladen; Andres Perea; Shelemia Nyamuryekung'e; Juan J. Villalba
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The study used Random Test-Split (RTS) and Cross-Validation (CV) machine learning methods to test different models to classify cattle behavior foraging behaviors states, foraging activities, posture, and activity by posture, using GPS coupled accelerometer data with 12-hour / days continuous recording observation as supporting ground truth. RTS in XGBoost performing best for general activity state classification, while CV in Random Forest excelled in more detailed foraging activities and activity-posture classifications. Key movement indicators like speed, Actindex and sensor values (x, y, and z) were vital in predicting behaviors, suggesting specific sensors for tracking behaviors of interest to ranchers. The results highlight the benefits of continuous monitoring and advanced data analysis for real-time livestock tracking, leading to better grazing management, improved animal welfare, and more sustainable land use.
f
Data_Sheet_2_A Novel XGBoost Method to Identify Cancer Tissue-of-Origin...
frontiersin.figshare.com
txt
Updated Jun 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yulin Zhang; Tong Feng; Shudong Wang; Ruyi Dong; Jialiang Yang; Jionglong Su; Bo Wang (2023). Data_Sheet_2_A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations.CSV [Dataset]. http://doi.org/10.3389/fgene.2020.585029.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2020.585029.s002
Dataset updated
Jun 5, 2023
Dataset provided by
Frontiers
Authors
Yulin Zhang; Tong Feng; Shudong Wang; Ruyi Dong; Jialiang Yang; Jionglong Su; Bo Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer.
Dataset for Classification of Suspicious Financial Transactions
zenodo.org
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edho Dwi Jayanto; Edho Dwi Jayanto (2025). Dataset for Classification of Suspicious Financial Transactions [Dataset]. http://doi.org/10.5281/zenodo.15493392
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15493392
Dataset updated
May 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edho Dwi Jayanto; Edho Dwi Jayanto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract— This study investigates the application of machine learning models for detecting suspicious financial transactions. Utilizing a dataset of 12,571 transactions from PT Bank ABC, the research encompasses various stages such as data preprocessing, feature selection, and addressing class imbalance. The models evaluated include Random Forest, XGBoost, and SVM, which were assessed through cross-validation with StratifiedKFold and optimized using RandomizedSearchCV.
Data from: Gradient Boosting Machine Learning to Improve Satellite-Derived...
zenodo.org
zip
Updated Nov 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allan C. Just; Allan C. Just; Yang Liu; Yang Liu; Meytar Sorek-Hamer; Meytar Sorek-Hamer; Johnathan Rush; Johnathan Rush; Michael Dorman; Michael Dorman; Robert Chatfield; Robert Chatfield; Yujie Wang; Alexei Lyapustin; Itai Kloog; Yujie Wang; Alexei Lyapustin; Itai Kloog (2021). Gradient Boosting Machine Learning to Improve Satellite-Derived Column Water Vapor Measurement Error [Dataset]. http://doi.org/10.5281/zenodo.3542300
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3542300
Dataset updated
Nov 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Allan C. Just; Allan C. Just; Yang Liu; Yang Liu; Meytar Sorek-Hamer; Meytar Sorek-Hamer; Johnathan Rush; Johnathan Rush; Michael Dorman; Michael Dorman; Robert Chatfield; Robert Chatfield; Yujie Wang; Alexei Lyapustin; Itai Kloog; Yujie Wang; Alexei Lyapustin; Itai Kloog
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The atmospheric products of the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm include column water vapor (CWV) at 1 km resolution, derived from daily overpasses of NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) instruments aboard the Aqua and Terra satellites. We have recently shown that machine learning using extreme gradient boosting (XGBoost) can improve the estimation of MAIAC aerosol optical depth (AOD). Although MAIAC CWV is generally well validated (Pearson’s R >0.97 versus CWV from AERONET sun photometers), it has not yet been assessed whether machine-learning approaches can further improve CWV. Using a novel spatiotemporal cross-validation approach to avoid overfitting, our XGBoost model with nine features derived from land use terms, date, and ancillary variables from the MAIAC retrieval, quantifies and can correct a substantial portion of measurement error relative to collocated measures at AERONET sites (27.8% and 15.5% decrease in Root Mean Square Error (RMSE) for Terra and Aqua datasets, respectively) in the Northeastern USA, 2000-2015. We use machine-learning interpretation tools to illustrate complex patterns of measurement error and describe a positive bias in MAIAC Terra CWV worsening in recent summertime conditions. We validate our predictive model on MAIAC CWV estimates at independent stations from the SuomiNet GPS network where our corrections decrease the RMSE by 20% and 10% for Terra and Aqua MAIAC CWV. Empirically correcting for measurement error with machine-learning algorithms is a post-processing opportunity to improve satellite-derived CWV data for Earth science and remote sensing applications.

# About the attachment #

1. The first zip file contains the data (collocated datasets) alone. The datasets in the zip file are JSON files that can be opened directly in browsers, text editor, or R using functions like `jsonlite::fromJSON`.

2. The second zip file (CWV-project-repository) is an R project contains all the code (/Code) and data (/Data) needed to reproduce results.
The folder (/Intermediate) contains the intermediate cross-validation modeling results. If initiating R project using the _cwv_paper.Rproj_, the Rmarkdown file (mainly in _03_cwv_10by10cv_resultsmd.Rmd_) can reproduce all the results (figures and tables) used in the paper.
f
Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs....
frontiersin.figshare.com
xlsx
Updated Jun 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi (2023). Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs. Antagonistic Drug Combinations Built by XGBoost.XLSX [Dataset]. http://doi.org/10.3389/fgene.2019.00600.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2019.00600.s002
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers
Authors
Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Combinatorial drug therapy can improve the therapeutic effect and reduce the corresponding adverse events. In silico strategies to classify synergistic vs. antagonistic drug pairs is more efficient than experimental strategies. However, most of the developed methods have been applied only to cancer therapies. In this study, we introduce a novel method, XGBoost, based on five features of drugs and biomolecular networks of their targets, to classify synergistic vs. antagonistic drug combinations from different drug categories. We found that XGBoost outperformed other classifiers in both stratified fivefold cross-validation (CV) and independent validation. For example, XGBoost achieved higher predictive accuracy than other models (0.86, 0.78, 0.78, and 0.83 for XGBoost, logistic regression, naïve Bayesian, and random forest, respectively) for an independent validation set. We also found that the five-feature XGBoost model is much more effective at predicting combinatorial therapies that have synergistic effects than those with antagonistic effects. The five-feature XGBoost model was also validated on TCGA data with accuracy of 0.79 among the 61 tested drug pairs, which is comparable to that of DeepSynergy. Among the 14 main anatomical/pharmacological groups classified according to WHO Anatomic Therapeutic Class, for drugs belonging to five groups, their prediction accuracy was significantly increased (odds ratio < 1) or reduced (odds ratio > 1) (Fisher’s exact test, p < 0.05). This study concludes that our five-feature XGBoost model has significant benefits for classifying synergistic vs. antagonistic drug combinations.
D
Data from: Detection of illicit accounts over the Ethereum blockchain
test.dataverse.nl
dataverse.nl
csv, txt
Updated Feb 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Azzopardi; Steven Farrugia; Joshua Ellul; Joshua Ellul; George Azzopardi; Steven Farrugia (2021). Detection of illicit accounts over the Ethereum blockchain [Dataset]. http://doi.org/10.34894/GKAQYN
Explore at:
csv(1016388), txt(506)Available download formats
Unique identifier
https://doi.org/10.34894/GKAQYN
Dataset updated
Feb 19, 2021
Dataset provided by
DataverseNL (test)
Authors
George Azzopardi; Steven Farrugia; Joshua Ellul; Joshua Ellul; George Azzopardi; Steven Farrugia
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The recent technological advent of cryptocurrencies and their respective benefits have been shrouded with a number of illegal activities operating over the network such as money laundering, bribery, phishing, fraud, among others. In this work we focus on the Ethereum network, which has seen over 400 million transactions since its inception. Using 2179 accounts flagged by the Ethereum community for their illegal activity coupled with 2502 normal accounts, we seek to detect illicit accounts based on their transaction history using the XGBoost classifier. Using 10 fold cross-validation, XGBoost achieved an average accuracy of 0.963 ( ± 0.006) with an average AUC of 0.994 ( ± 0.0007). The top three features with the largest impact on the final model output were established to be ‘Time diff between first and last (Mins)’, ‘Total Ether balance’ and ‘Min value received’. Based on the results we conclude that the proposed approach is highly effective in detecting illicit accounts over the Ethereum network. Our contribution is multi-faceted; firstly, we propose an effective method to detect illicit accounts over the Ethereum network; secondly, we provide insights about the most important features; and thirdly, we publish the compiled data set as a benchmark for future related works.
f
Data_Sheet_1_Non-motor Clinical and Biomarker Predictors Enable High...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charles Leger; Monique Herbert; Joseph F. X. DeSouza (2023). Data_Sheet_1_Non-motor Clinical and Biomarker Predictors Enable High Cross-Validated Accuracy Detection of Early PD but Lesser Cross-Validated Accuracy Detection of Scans Without Evidence of Dopaminergic Deficit.PDF [Dataset]. http://doi.org/10.3389/fneur.2020.00364.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fneur.2020.00364.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Charles Leger; Monique Herbert; Joseph F. X. DeSouza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Early stage (preclinical) detection of Parkinson's disease (PD) remains challenged yet is crucial to both differentiate it from other disorders and facilitate timely administration of neuroprotective treatment as it becomes available.Objective: In a cross-validation paradigm, this work focused on two binary predictive probability analyses: classification of early PD vs. controls and classification of early PD vs. SWEDD (scans without evidence of dopamine deficit). It was hypothesized that five distinct model types using combined non-motor and biomarker features would distinguish early PD from controls with > 80% cross-validated (CV) accuracy, but that the diverse nature of the SWEDD category would reduce early PD vs. SWEDD CV classification accuracy and alter model-based feature selection.Methods: Cross-sectional, baseline data was acquired from the Parkinson's Progressive Markers Initiative (PPMI). Logistic regression, general additive (GAM), decision tree, random forest and XGBoost models were fitted using non-motor clinical and biomarker features. Randomized train and test data partitions were created. Model classification CV performance was compared using the area under the curve (AUC), sensitivity, specificity and the Kappa statistic.Results: All five models achieved >0.80 AUC CV accuracy to distinguish early PD from controls. The GAM (CV AUC 0.928, sensitivity 0.898, specificity 0.897) and XGBoost (CV AUC 0.923, sensitivity 0.875, specificity 0.897) models were the top classifiers. Performance across all models was consistently lower in the early PD/SWEDD analyses, where the highest performing models were XGBoost (CV AUC 0.863, sensitivity 0.905, specificity 0.748) and random forest (CV AUC 0.822, sensitivity 0.809, specificity 0.721). XGBoost detection of non-PD SWEDD matched 1–2 years curated diagnoses in 81.25% (13/16) cases. In both early PD/control and early PD/SWEDD analyses, and across all models, hyposmia was the single most important feature to classification; rapid eye movement behavior disorder (questionnaire) was the next most commonly high ranked feature. Alpha-synuclein was a feature of import to early PD/control but not early PD/SWEDD classification and the Epworth Sleepiness scale was antithetically important to the latter but not former.Interpretation: Non-motor clinical and biomarker variables enable high CV discrimination of early PD vs. controls but are less effective discriminating early PD from SWEDD.
f
The dataset used in this study.
plos.figshare.com
xls
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiyu Wang; Niaz Muhammad Shahani; Xigui Zheng; Jiang Hongwei; Xin Wei (2025). The dataset used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0314977.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314977.t001
Dataset updated
Feb 6, 2025
Dataset provided by
PLOS ONE
Authors
Jiyu Wang; Niaz Muhammad Shahani; Xigui Zheng; Jiang Hongwei; Xin Wei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accurately evaluating earthquake-induced slope displacement is a key factor for designing slopes that can effectively respond to seismic activity. This study evaluates the capabilities of various machine learning models, including artificial neural network (ANN), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost) in analyzing earthquake-induced slope displacement. A dataset of 45 samples was used, with 70% allocated for training and 30% for testing. To improve model robustness, repeated 5-fold cross-validation was applied. Among the models, XGBoost demonstrated superior predictive accuracy, with an R2 value of 0.99 on both the train and test data, outperforming ANN, SVM, and RF, which had R2 values of 0.63 and 0.80, 0.87 and 0.86, 0.94 and 0.87 on the train and test data, respectively. Sensitivity analysis identified maximum horizontal acceleration (kmax = 0.714) as the most influential factor in slope displacement. The findings suggest that the XGBoost model developed in this study is highly effective in predicting earthquake-induced slope displacement, offering valuable insights for early warning systems and slope stability management.
f
Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1...
figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins (2023). Machine Learning Models Identify New Inhibitors for Human OATP1B1 [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.2c00662.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.molpharmaceut.2c00662.s002
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Thomas R. Lane; Fabio Urbina; Xiaohong Zhang; Margret Fye; Jacob Gerlach; Stephen H. Wright; Sean Ekins
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The uptake transporter OATP1B1 (SLC01B1) is largely localized to the sinusoidal membrane of hepatocytes and is a known victim of unwanted drug–drug interactions. Computational models are useful for identifying potential substrates and/or inhibitors of clinically relevant transporters. Our goal was to generate OATP1B1 in vitro inhibition data for [3H] estrone-3-sulfate (E3S) transport in CHO cells and use it to build machine learning models to facilitate a comparison of seven different classification models (Deep learning, Adaboosted decision trees, Bernoulli naïve bayes, k-nearest neighbors (knn), random forest, support vector classifier (SVC), logistic regression (lreg), and XGBoost (xgb)] using ECFP6 fingerprints to perform 5-fold, nested cross validation. In addition, we compared models using 3D pharmacophores, simple chemical descriptors alone or plus ECFP6, as well as ECFP4 and ECFP8 fingerprints. Several machine learning algorithms (SVC, lreg, xgb, and knn) had excellent nested cross validation statistics, particularly for accuracy, AUC, and specificity. An external test set containing 207 unique compounds not in the training set demonstrated that at every threshold SVC outperformed the other algorithms based on a rank normalized score. A prospective validation test set was chosen using prediction scores from the SVC models with ECFP fingerprints and were tested in vitro with 15 of 19 compounds (84% accuracy) predicted as active (≥20% inhibition) showed inhibition. Of these compounds, six (abamectin, asiaticoside, berbamine, doramectin, mobocertinib, and umbralisib) appear to be novel inhibitors of OATP1B1 not previously reported. These validated machine learning models can now be used to make predictions for drug–drug interactions for human OATP1B1 alongside other machine learning models for important drug transporters in our MegaTrans software.
f
Raw data.
figshare.com
bin
Updated Aug 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen (2023). Raw data. [Dataset]. http://doi.org/10.1371/journal.pone.0289621.s002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289621.s002
Dataset updated
Aug 11, 2023
Dataset provided by
PLOS ONE
Authors
Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundColon cancer recurrence is a common adverse outcome for patients after complete mesocolic excision (CME) and greatly affects the near-term and long-term prognosis of patients. This study aimed to develop a machine learning model that can identify high-risk factors before, during, and after surgery, and predict the occurrence of postoperative colon cancer recurrence.MethodsThe study included 1187 patients with colon cancer, including 110 patients who had recurrent colon cancer. The researchers collected 44 characteristic variables, including patient demographic characteristics, basic medical history, preoperative examination information, type of surgery, and intraoperative information. Four machine learning algorithms, namely extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and k-nearest neighbor algorithm (KNN), were used to construct the model. The researchers evaluated the model using the k-fold cross-validation method, ROC curve, calibration curve, decision curve analysis (DCA), and external validation.ResultsAmong the four prediction models, the XGBoost algorithm performed the best. The ROC curve results showed that the AUC value of XGBoost was 0.962 in the training set and 0.952 in the validation set, indicating high prediction accuracy. The XGBoost model was stable during internal validation using the k-fold cross-validation method. The calibration curve demonstrated high predictive ability of the XGBoost model. The DCA curve showed that patients who received interventional treatment had a higher benefit rate under the XGBoost model. The external validation set’s AUC value was 0.91, indicating good extrapolation of the XGBoost prediction model.ConclusionThe XGBoost machine learning algorithm-based prediction model for colon cancer recurrence has high prediction accuracy and clinical utility.
f
Table_7_Preliminary prediction of semen quality based on modifiable...
figshare.com
docx
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu (2023). Table_7_Preliminary prediction of semen quality based on modifiable lifestyle factors by using the XGBoost algorithm.docx [Dataset]. http://doi.org/10.3389/fmed.2022.811890.s008
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fmed.2022.811890.s008
Dataset updated
Jun 16, 2023
Dataset provided by
Frontiers
Authors
Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionSemen quality has decreased gradually in recent years, and lifestyle changes are among the primary causes for this issue. Thus far, the specific lifestyle factors affecting semen quality remain to be elucidated.Materials and methodsIn this study, data on the following factors were collected from 5,109 men examined at our reproductive medicine center: 10 lifestyle factors that potentially affect semen quality (smoking status, alcohol consumption, staying up late, sleeplessness, consumption of pungent food, intensity of sports activity, sedentary lifestyle, working in hot conditions, sauna use in the last 3 months, and exposure to radioactivity); general factors including age, abstinence period, and season of semen examination; and comprehensive semen parameters [semen volume, sperm concentration, progressive and total sperm motility, sperm morphology, and DNA fragmentation index (DFI)]. Then, machine learning with the XGBoost algorithm was applied to establish a primary prediction model by using the collected data. Furthermore, the accuracy of the model was verified via multiple logistic regression following k-fold cross-validation analyses.ResultsThe results indicated that for semen volume, sperm concentration, progressive and total sperm motility, and DFI, the area under the curve (AUC) values ranged from 0.648 to 0.697, while the AUC for sperm morphology was only 0.506. Among the 13 factors, smoking status was the major factor affecting semen volume, sperm concentration, and progressive and total sperm motility. Age was the most important factor affecting DFI. Logistic combined with cross-validation analysis revealed similar results. Furthermore, it showed that heavy smoking (>20 cigarettes/day) had an overall negative effect on semen volume and sperm concentration and progressive and total sperm motility (OR = 4.69, 6.97, 11.16, and 10.35, respectively), while age of >35 years was associated with increased DFI (OR = 5.47).ConclusionThe preliminary lifestyle-based model developed for semen quality prediction by using the XGBoost algorithm showed potential for clinical application and further optimization with larger training datasets.
f
Parameter Values of the models.
plos.figshare.com
xls
Updated Jun 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jin Wang; Shihan Ma; Qing Lv; Qiang Li (2025). Parameter Values of the models. [Dataset]. http://doi.org/10.1371/journal.pone.0320298.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320298.t002
Dataset updated
Jun 25, 2025
Dataset provided by
PLOS ONE
Authors
Jin Wang; Shihan Ma; Qing Lv; Qiang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Population prediction could provide effective data support for social and economic planning and decision-making, especially for the sub-national population forecasting accurately. In addition to realizing efficient smart population management, this research focuses primarily on the combination model for forecasting demographic data based on machine learning. As to the higher error of population forecasts due to high population density and mobility, a dynamic monitoring method based on mobile communication big data such as mobile phone signals is proposed, combined with more structurally stable traditional statistical data, it forms a multi-source dataset that possesses both accuracy and real-time characteristics. In the study, the Extreme Gradient Boosting tree (XGBoost) model is used to identify the base model to create a reliable predictive model for population dynamic monitoring. The sparrow search algorithm (SSA) is investigated to obtain more reasonable parameters of XGBoost to improve forecast accuracy. The combination model is verified based on the data of the 6th and 7th national population census and mobile phone signal data in Hebei Province, obtained the predicted data for mortality and migration, categorized by age and gender, for the following year. Subsequently, the research compared the performance of different metaheuristic algorithms and various gradient-boosting machine-learning models on the dataset. The SSA-XGBoost model demonstrates a better prediction performance in the demographic data forecast with better R2 0.9984 and a lower mean absolute error of 0.0002 and a mean squared error of 6.9184. The results of the comparative experiments and cross-validation show that the proposed predictive model can effectively forecast the demographic data for sub-national regions to realize smart population management.
Average feature ranks in the LASSO and xgboost models.
plos.figshare.com
xlsx
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaked Bergman; Tamir Tuller (2024). Average feature ranks in the LASSO and xgboost models. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012214.s006
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012214.s006
Dataset updated
Jun 20, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Shaked Bergman; Tamir Tuller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The average rank of the 3D feature’s importance value over the 1000 cross-validation iterations described in the paper. A lower rank denotes a more informative feature. (XLSX)
f
Table_3_T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV...
frontiersin.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianhang Chen; Xiangeng Wang; Yanyi Chu; Yanjing Wang; Mingming Jiang; Dong-Qing Wei; Yi Xiong (2023). Table_3_T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm.csv [Dataset]. http://doi.org/10.3389/fmicb.2020.580382.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fmicb.2020.580382.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Tianhang Chen; Xiangeng Wang; Yanyi Chu; Yanjing Wang; Mingming Jiang; Dong-Qing Wei; Yi Xiong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.
f
The five cross-validation stages involved in the present study.
plos.figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lifeng Wu; Junliang Fan (2023). The five cross-validation stages involved in the present study. [Dataset]. http://doi.org/10.1371/journal.pone.0217520.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0217520.t002
Dataset updated
Jun 5, 2023
Dataset provided by
PLOS ONE
Authors
Lifeng Wu; Junliang Fan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The five cross-validation stages involved in the present study.
f
Data_Sheet_1_Machine learning predicts the prognosis of breast cancer...
frontiersin.figshare.com
txt
Updated Jun 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chaofan Li; Mengjie Liu; Jia Li; Weiwei Wang; Cong Feng; Yifan Cai; Fei Wu; Xixi Zhao; Chong Du; Yinbin Zhang; Yusheng Wang; Shuqun Zhang; Jingkun Qu (2023). Data_Sheet_1_Machine learning predicts the prognosis of breast cancer patients with initial bone metastases.CSV [Dataset]. http://doi.org/10.3389/fpubh.2022.1003976.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2022.1003976.s001
Dataset updated
Jun 16, 2023
Dataset provided by
Frontiers
Authors
Chaofan Li; Mengjie Liu; Jia Li; Weiwei Wang; Cong Feng; Yifan Cai; Fei Wu; Xixi Zhao; Chong Du; Yinbin Zhang; Yusheng Wang; Shuqun Zhang; Jingkun Qu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundBone is the most common metastatic site of patients with advanced breast cancer and the survival time is their primary concern; however, we lack accurate predictive models in clinical practice. In addition to this, primary surgery for breast cancer patients with bone metastases is still controversial.MethodThe data used for analysis in this study were obtained from the SEER database (2010–2019). We made a COX regression analysis to identify prognostic factors of patients with bone metastatic breast cancer (BMBC). Through cross-validation, we constructed an XGBoost model to predicting survival in patients with BMBC. We also investigated the prognosis of patients treated with neoadjuvant chemotherapy plus surgical and chemotherapy alone using propensity score matching and K–M survival analysis.ResultsOur validation results showed that the model has high sensitivity, specificity, and correctness, and it is the most accurate one to predict the survival of patients with BMBC (1-year AUC = 0.818, 3-year AUC = 0.798, and 5-year survival AUC = 0.791). The sensitivity of the 1-year model was higher (0.79), while the specificity of the 5-year model was higher (0.86). Interestingly, we found that if the time from diagnosis to therapy was ≥1 month, patients with BMBC had even better survival than those who started treatment immediately (HR = 0.920, 95%CI 0.869–0.974, P < 0.01). The BMBC patients with an income of more than USD$70,000 had better OS (HR = 0.814, 95%CI 0.745–0.890, P < 0.001) and BCSS (HR = 0.808 95%CI 0.735–0.889, P < 0.001) than who with income of < USD$50,000. We also found that compared with chemotherapy alone, neoadjuvant chemotherapy plus surgical treatment significantly improved OS and BCSS in all molecular subtypes of patients with BMBC, while only the patients with bone metastases only, bone and liver metastases, bone and lung metastases could benefit from neoadjuvant chemotherapy plus surgical treatment.ConclusionWe constructed an AI model to provide a quantitative method to predict the survival of patients with BMBC, and our validation results indicate that this model should be highly reproducible in a similar patient population. We also identified potential prognostic factors for patients with BMBC and suggested that primary surgery followed by neoadjuvant chemotherapy might increase survival in a selected subgroup of patients.
f
Data_Sheet_1_Analysis of hematological indicators via explainable artificial...
frontiersin.figshare.com
xlsx
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rustem Yilmaz; Fatma Hilal Yagin; Cemil Colak; Kenan Toprak; Nagwan Abdel Samee; Noha F. Mahmoud; Amnah Ali Alshahrani (2024). Data_Sheet_1_Analysis of hematological indicators via explainable artificial intelligence in the diagnosis of acute heart failure: a retrospective study.xlsx [Dataset]. http://doi.org/10.3389/fmed.2024.1285067.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fmed.2024.1285067.s001
Dataset updated
Apr 2, 2024
Dataset provided by
Frontiers
Authors
Rustem Yilmaz; Fatma Hilal Yagin; Cemil Colak; Kenan Toprak; Nagwan Abdel Samee; Noha F. Mahmoud; Amnah Ali Alshahrani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionAcute heart failure (AHF) is a serious medical problem that necessitates hospitalization and often results in death. Patients hospitalized in the emergency department (ED) should therefore receive an immediate diagnosis and treatment. Unfortunately, there is not yet a fast and accurate laboratory test for identifying AHF. The purpose of this research is to apply the principles of explainable artificial intelligence (XAI) to the analysis of hematological indicators for the diagnosis of AHF.MethodsIn this retrospective analysis, 425 patients with AHF and 430 healthy individuals served as assessments. Patients’ demographic and hematological information was analyzed to diagnose AHF. Important risk variables for AHF diagnosis were identified using the Least Absolute Shrinkage and Selection Operator (LASSO) feature selection. To test the efficacy of the suggested prediction model, Extreme Gradient Boosting (XGBoost), a 10-fold cross-validation procedure was implemented. The area under the receiver operating characteristic curve (AUC), F1 score, Brier score, Positive Predictive Value (PPV), and Negative Predictive Value (NPV) were all computed to evaluate the model’s efficacy. Permutation-based analysis and SHAP were used to assess the importance and influence of the model’s incorporated risk factors.ResultsWhite blood cell (WBC), monocytes, neutrophils, neutrophil-lymphocyte ratio (NLR), red cell distribution width-standard deviation (RDW-SD), RDW-coefficient of variation (RDW-CV), and platelet distribution width (PDW) values were significantly higher than the healthy group (p
f
Table_2_A Machine Learning Model to Predict Risperidone Active Moiety...
frontiersin.figshare.com
docx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Guo; Ze Yu; Ya Gao; Xiaoqian Lan; Yannan Zang; Peng Yu; Zeyuan Wang; Wenzhuo Sun; Xin Hao; Fei Gao (2023). Table_2_A Machine Learning Model to Predict Risperidone Active Moiety Concentration Based on Initial Therapeutic Drug Monitoring.DOCX [Dataset]. http://doi.org/10.3389/fpsyt.2021.711868.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyt.2021.711868.s002
Dataset updated
Jun 8, 2023
Dataset provided by
Frontiers
Authors
Wei Guo; Ze Yu; Ya Gao; Xiaoqian Lan; Yannan Zang; Peng Yu; Zeyuan Wang; Wenzhuo Sun; Xin Hao; Fei Gao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Risperidone is an efficacious second-generation antipsychotic (SGA) to treat a wide spectrum of psychiatric diseases, whereas its active moiety (risperidone and 9-hydroxyrisperidone) concentration without a therapeutic reference range may increase the risk of adverse drug reactions. We aimed to establish a prediction model of risperidone active moiety concentration in the next therapeutic drug monitoring (TDM) based on the initial TDM information using machine learning methods. A total of 983 patients treated with risperidone between May 2017 and May 2018 in Beijing Anding Hospital were collected as the data set. Sixteen predictors (the initial TDM value, dosage, age, WBC, PLT, BUN, weight, BMI, prolactin, ALT, MECT, Cr, AST, Ccr, TDM interval, and RBC) were screened from 26 variables through univariate analysis (p < 0.05) and XGBoost (importance score >0). Ten algorithms (XGBoost, LightGBM, CatBoost, AdaBoost, Random Forest, support vector machine, lasso regression, ridge regression, linear regression, and k-nearest neighbor) compared the model performance, and ultimately, XGBoost was chosen to establish the prediction model. A cohort of 210 patients treated with risperidone between March 1, 2019, and May 31, 2019, in Beijing Anding Hospital was used to validate the model. Finally, the prediction model was evaluated, obtaining R2 (0.512 in test cohort; 0.374 in validation cohort), MAE (10.97 in test cohort; 12.07 in validation cohort), MSE (198.55 in test cohort; 324.15 in validation cohort), RMSE (14.09 in test cohort; 18.00 in validation cohort), and accuracy of the predicted TDM within ±30% of the actual TDM (54.82% in test cohort; 60.95% in validation cohort). The prediction model has promising performance to facilitate rational risperidone regimen on an individualized level and provide reference for other antipsychotic drugs' risk prediction.
f
Descriptive statistics of data set.
plos.figshare.com
xls
Updated Jun 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davood Fereidooni; Zohre Karimi; Fatemeh Ghasemi (2024). Descriptive statistics of data set. [Dataset]. http://doi.org/10.1371/journal.pone.0302944.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302944.t002
Dataset updated
Jun 10, 2024
Dataset provided by
PLOS ONE
Authors
Davood Fereidooni; Zohre Karimi; Fatemeh Ghasemi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The uniaxial compressive strength (UCS) and elasticity modulus (E) of intact rock are two fundamental requirements in engineering applications. These parameters can be measured either directly from the uniaxial compressive strength test or indirectly by using soft computing predictive models. In the present research, the UCS and E of intact carbonate rocks have been predicted by introducing two stacking ensemble learning models from non-destructive simple laboratory test results. For this purpose, dry unit weight, porosity, P‐wave velocity, Brinell surface harnesses, UCS, and static E were measured for 70 carbonate rock samples. Then, two stacking ensemble learning models were developed for estimating the UCS and E of the rocks. The applied stacking ensemble learning method integrates the advantages of two base models in the first level, where base models are multi-layer perceptron (MLP) and random forest (RF) for predicting UCS, and support vector regressor (SVR) and extreme gradient boosting (XGBoost) for predicting E. Grid search integrating k-fold cross validation is applied to tune the parameters of both base models and meta-learner. The results demonstrate the generalization ability of the stacking ensemble method in the comparison of base models in the terms of common performance measures. The values of coefficient of determination (R2) obtained from the stacking ensemble are 0.909 and 0.831 for predicting UCS and E, respectively. Similarly, the stacking ensemble yielded Root Mean Squared Error (RMSE) values of 1.967 and 0.621 for the prediction of UCS and E, respectively. Accordingly, the proposed models have superiority in the comparison of SVR and MLP as single models and RF and XGBoost as two representative ensemble models. Furthermore, sensitivity analysis is carried out to investigate the impact of input parameters.
f
Table_4_T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV...
frontiersin.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tianhang Chen; Xiangeng Wang; Yanyi Chu; Yanjing Wang; Mingming Jiang; Dong-Qing Wei; Yi Xiong (2023). Table_4_T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm.csv [Dataset]. http://doi.org/10.3389/fmicb.2020.580382.s003
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.3389/fmicb.2020.580382.s003
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Tianhang Chen; Xiangeng Wang; Yanyi Chu; Yanjing Wang; Mingming Jiang; Dong-Qing Wei; Yi Xiong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.

Facebook

Twitter

Click to copy link

Link copied

Cite

Iddy Muzzo; Kelvyn Bladen; Andres Perea; Shelemia Nyamuryekung'e; Juan J. Villalba (2025). Multi-Sensor Integration and Machine Learning for High-Resolution Classification of Herbivore Foraging Behavior [Dataset]. http://doi.org/10.15482/USDA.ADC/28507400.v1

Data from: Multi-Sensor Integration and Machine Learning for High-Resolution Classification of Herbivore Foraging Behavior

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.15482/USDA.ADC/28507400.v1

Dataset updated

May 16, 2025

Dataset provided by

Ag Data Commons

Authors

Iddy Muzzo; Kelvyn Bladen; Andres Perea; Shelemia Nyamuryekung'e; Juan J. Villalba

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

The study used Random Test-Split (RTS) and Cross-Validation (CV) machine learning methods to test different models to classify cattle behavior foraging behaviors states, foraging activities, posture, and activity by posture, using GPS coupled accelerometer data with 12-hour / days continuous recording observation as supporting ground truth. RTS in XGBoost performing best for general activity state classification, while CV in Random Forest excelled in more detailed foraging activities and activity-posture classifications. Key movement indicators like speed, Actindex and sensor values (x, y, and z) were vital in predicting behaviors, suggesting specific sensors for tracking behaviors of interest to ranchers. The results highlight the benefits of continuous monitoring and advanced data analysis for real-time livestock tracking, leading to better grazing management, improved animal welfare, and more sustainable land use.

Clear search

Close search

Google apps

Main menu

Data from: Multi-Sensor Integration and Machine Learning for High-Resolution...

Data_Sheet_2_A Novel XGBoost Method to Identify Cancer Tissue-of-Origin...

Dataset for Classification of Suspicious Financial Transactions

Data from: Gradient Boosting Machine Learning to Improve Satellite-Derived...

Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs....

Data from: Detection of illicit accounts over the Ethereum blockchain

Data_Sheet_1_Non-motor Clinical and Biomarker Predictors Enable High...

The dataset used in this study.

Data from: Machine Learning Models Identify New Inhibitors for Human OATP1B1...

Raw data.

Table_7_Preliminary prediction of semen quality based on modifiable...

Parameter Values of the models.

Average feature ranks in the LASSO and xgboost models.

Table_3_T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV...

The five cross-validation stages involved in the present study.

Data_Sheet_1_Machine learning predicts the prognosis of breast cancer...

Data_Sheet_1_Analysis of hematological indicators via explainable artificial...

Table_2_A Machine Learning Model to Predict Risperidone Active Moiety...

Descriptive statistics of data set.

Table_4_T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV...

Data from: Multi-Sensor Integration and Machine Learning for High-Resolution Classification of Herbivore Foraging Behavior