20 datasets found
  1. MoA Weight: XGBoost

    • kaggle.com
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tawara (2020). MoA Weight: XGBoost [Dataset]. https://www.kaggle.com/datasets/ttahara/moa-weight-xgb-seed-cv/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tawara
    Description

    Dataset

    This dataset was created by Tawara

    Released under CC0: Public Domain

    Contents

  2. f

    Data_Sheet_1_Non-motor Clinical and Biomarker Predictors Enable High...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles Leger; Monique Herbert; Joseph F. X. DeSouza (2023). Data_Sheet_1_Non-motor Clinical and Biomarker Predictors Enable High Cross-Validated Accuracy Detection of Early PD but Lesser Cross-Validated Accuracy Detection of Scans Without Evidence of Dopaminergic Deficit.PDF [Dataset]. http://doi.org/10.3389/fneur.2020.00364.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Charles Leger; Monique Herbert; Joseph F. X. DeSouza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Early stage (preclinical) detection of Parkinson's disease (PD) remains challenged yet is crucial to both differentiate it from other disorders and facilitate timely administration of neuroprotective treatment as it becomes available.Objective: In a cross-validation paradigm, this work focused on two binary predictive probability analyses: classification of early PD vs. controls and classification of early PD vs. SWEDD (scans without evidence of dopamine deficit). It was hypothesized that five distinct model types using combined non-motor and biomarker features would distinguish early PD from controls with > 80% cross-validated (CV) accuracy, but that the diverse nature of the SWEDD category would reduce early PD vs. SWEDD CV classification accuracy and alter model-based feature selection.Methods: Cross-sectional, baseline data was acquired from the Parkinson's Progressive Markers Initiative (PPMI). Logistic regression, general additive (GAM), decision tree, random forest and XGBoost models were fitted using non-motor clinical and biomarker features. Randomized train and test data partitions were created. Model classification CV performance was compared using the area under the curve (AUC), sensitivity, specificity and the Kappa statistic.Results: All five models achieved >0.80 AUC CV accuracy to distinguish early PD from controls. The GAM (CV AUC 0.928, sensitivity 0.898, specificity 0.897) and XGBoost (CV AUC 0.923, sensitivity 0.875, specificity 0.897) models were the top classifiers. Performance across all models was consistently lower in the early PD/SWEDD analyses, where the highest performing models were XGBoost (CV AUC 0.863, sensitivity 0.905, specificity 0.748) and random forest (CV AUC 0.822, sensitivity 0.809, specificity 0.721). XGBoost detection of non-PD SWEDD matched 1–2 years curated diagnoses in 81.25% (13/16) cases. In both early PD/control and early PD/SWEDD analyses, and across all models, hyposmia was the single most important feature to classification; rapid eye movement behavior disorder (questionnaire) was the next most commonly high ranked feature. Alpha-synuclein was a feature of import to early PD/control but not early PD/SWEDD classification and the Epworth Sleepiness scale was antithetically important to the latter but not former.Interpretation: Non-motor clinical and biomarker variables enable high CV discrimination of early PD vs. controls but are less effective discriminating early PD from SWEDD.

  3. u

    Data from: Multi-Sensor Integration and Machine Learning for High-Resolution...

    • agdatacommons.nal.usda.gov
    xlsx
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iddy Muzzo; Kelvyn Bladen; Andres Perea; Shelemia Nyamuryekung'e; Juan J. Villalba (2025). Multi-Sensor Integration and Machine Learning for High-Resolution Classification of Herbivore Foraging Behavior [Dataset]. http://doi.org/10.15482/USDA.ADC/28507400.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 16, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Iddy Muzzo; Kelvyn Bladen; Andres Perea; Shelemia Nyamuryekung'e; Juan J. Villalba
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The study used Random Test-Split (RTS) and Cross-Validation (CV) machine learning methods to test different models to classify cattle behavior foraging behaviors states, foraging activities, posture, and activity by posture, using GPS coupled accelerometer data with 12-hour / days continuous recording observation as supporting ground truth. RTS in XGBoost performing best for general activity state classification, while CV in Random Forest excelled in more detailed foraging activities and activity-posture classifications. Key movement indicators like speed, Actindex and sensor values (x, y, and z) were vital in predicting behaviors, suggesting specific sensors for tracking behaviors of interest to ranchers. The results highlight the benefits of continuous monitoring and advanced data analysis for real-time livestock tracking, leading to better grazing management, improved animal welfare, and more sustainable land use.

  4. f

    Data_Sheet_2_Non-motor Clinical and Biomarker Predictors Enable High...

    • frontiersin.figshare.com
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles Leger; Monique Herbert; Joseph F. X. DeSouza (2023). Data_Sheet_2_Non-motor Clinical and Biomarker Predictors Enable High Cross-Validated Accuracy Detection of Early PD but Lesser Cross-Validated Accuracy Detection of Scans Without Evidence of Dopaminergic Deficit.ZIP [Dataset]. http://doi.org/10.3389/fneur.2020.00364.s003
    Explore at:
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Charles Leger; Monique Herbert; Joseph F. X. DeSouza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Early stage (preclinical) detection of Parkinson's disease (PD) remains challenged yet is crucial to both differentiate it from other disorders and facilitate timely administration of neuroprotective treatment as it becomes available.Objective: In a cross-validation paradigm, this work focused on two binary predictive probability analyses: classification of early PD vs. controls and classification of early PD vs. SWEDD (scans without evidence of dopamine deficit). It was hypothesized that five distinct model types using combined non-motor and biomarker features would distinguish early PD from controls with > 80% cross-validated (CV) accuracy, but that the diverse nature of the SWEDD category would reduce early PD vs. SWEDD CV classification accuracy and alter model-based feature selection.Methods: Cross-sectional, baseline data was acquired from the Parkinson's Progressive Markers Initiative (PPMI). Logistic regression, general additive (GAM), decision tree, random forest and XGBoost models were fitted using non-motor clinical and biomarker features. Randomized train and test data partitions were created. Model classification CV performance was compared using the area under the curve (AUC), sensitivity, specificity and the Kappa statistic.Results: All five models achieved >0.80 AUC CV accuracy to distinguish early PD from controls. The GAM (CV AUC 0.928, sensitivity 0.898, specificity 0.897) and XGBoost (CV AUC 0.923, sensitivity 0.875, specificity 0.897) models were the top classifiers. Performance across all models was consistently lower in the early PD/SWEDD analyses, where the highest performing models were XGBoost (CV AUC 0.863, sensitivity 0.905, specificity 0.748) and random forest (CV AUC 0.822, sensitivity 0.809, specificity 0.721). XGBoost detection of non-PD SWEDD matched 1–2 years curated diagnoses in 81.25% (13/16) cases. In both early PD/control and early PD/SWEDD analyses, and across all models, hyposmia was the single most important feature to classification; rapid eye movement behavior disorder (questionnaire) was the next most commonly high ranked feature. Alpha-synuclein was a feature of import to early PD/control but not early PD/SWEDD classification and the Epworth Sleepiness scale was antithetically important to the latter but not former.Interpretation: Non-motor clinical and biomarker variables enable high CV discrimination of early PD vs. controls but are less effective discriminating early PD from SWEDD.

  5. f

    Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs....

    • frontiersin.figshare.com
    xlsx
    Updated Jun 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi (2023). Table_1_Five-Feature Model for Developing the Classifier for Synergistic vs. Antagonistic Drug Combinations Built by XGBoost.XLSX [Dataset]. http://doi.org/10.3389/fgene.2019.00600.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Xiangjun Ji; Weida Tong; Zhichao Liu; Tieliu Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Combinatorial drug therapy can improve the therapeutic effect and reduce the corresponding adverse events. In silico strategies to classify synergistic vs. antagonistic drug pairs is more efficient than experimental strategies. However, most of the developed methods have been applied only to cancer therapies. In this study, we introduce a novel method, XGBoost, based on five features of drugs and biomolecular networks of their targets, to classify synergistic vs. antagonistic drug combinations from different drug categories. We found that XGBoost outperformed other classifiers in both stratified fivefold cross-validation (CV) and independent validation. For example, XGBoost achieved higher predictive accuracy than other models (0.86, 0.78, 0.78, and 0.83 for XGBoost, logistic regression, naïve Bayesian, and random forest, respectively) for an independent validation set. We also found that the five-feature XGBoost model is much more effective at predicting combinatorial therapies that have synergistic effects than those with antagonistic effects. The five-feature XGBoost model was also validated on TCGA data with accuracy of 0.79 among the 61 tested drug pairs, which is comparable to that of DeepSynergy. Among the 14 main anatomical/pharmacological groups classified according to WHO Anatomic Therapeutic Class, for drugs belonging to five groups, their prediction accuracy was significantly increased (odds ratio < 1) or reduced (odds ratio > 1) (Fisher’s exact test, p < 0.05). This study concludes that our five-feature XGBoost model has significant benefits for classifying synergistic vs. antagonistic drug combinations.

  6. n

    Data from: Trophic reorganization of animal communities under climate change...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Mendoza; Miguel B. Araujo (2024). Trophic reorganization of animal communities under climate change [Dataset]. http://doi.org/10.5061/dryad.dbrv15f83
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Consejo Superior de Investigaciones Científicas
    Authors
    Manuel Mendoza; Miguel B. Araujo
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Aim This study uses a novel modeling approach to understand global trophic structure transformations under 21st-century climate changes. The goal is to project and understand the impacts of climate change on trophic dynamics, guiding future research and conservation efforts. Location 14,520 terrestrial grid cells of 1° x 1° globally. Taxon Trophic structures were assessed for 15,265 species, including 9,993 non-marine birds and 5,272 terrestrial mammals, across 9 predefined trophic guilds. Methods A spatially explicit community trophic structure model, based on an extreme gradient boosting algorithm (Xgboost), was used. The model was trained with 1961-1990 climatic data and projected changes according to three Shared Socioeconomic Pathways: SSP2-45, SSP3-70, and SSP5-85. Results The Xgboost model showed high predictive accuracy (86%, kappa=0.91). Projections indicated many global regions are transitioning in their trophic structures due to climate changes from 1990 to 2018, with decreases in species carrying capacity in 5.5% of cells and increases in 9.8%. Predictions for mid- and late-21st century under climate scenarios suggest significant reorganization, with notable impacts in regions such as the Amazon Basin, Central Africa, and Southeast Asia. Under SSP5-85, 17.1% of cells may face reductions in carrying capacity, while 41.1% could see increases, affecting thousands of species. Main conclusions Climate change is profoundly reorganizing global trophic communities, with significant shifts in species carrying capacity across different guilds. Tropical regions and high northern latitudes are most affected, with some species facing collapses and others finding new opportunities. These changes highlight the need to integrate community trophic structure models into biodiversity conservation strategies, offering a comprehensive view of climate change impacts on trophic networks. Methods Data Collection Species Distribution Data Geographical data were garnered from two primary sources and subsequently plotted on a global terrestrial grid, with each cell measuring 1 × 1°. These sources included the global distribution ranges of terrestrial mammals and non-marine birds. The distributions of species, specifically 9,993 non-marine birds and 5,272 terrestrial mammals, totaling 15,265 species, were informed by the IUCN Global Assessment's data on native ranges (IUCN, 2014). To enable analysis, a presence/absence matrix was created. In this matrix, the species were aligned as columns, each named, against 14,498 terrestrial grid cells, each cell measuring 1 × 1°, as rows. These include all the non-coastal cells of the world, excluding Antarctica and some northern regions, such as most of Greenland, for which some data are lacking. This approach provided a clear, granular view of species distribution across the globe. Bioclimatic Variables The bioclimatic variables were divided into two datasets: historical (1961-2018) and future (2021-2100). Historical bioclimatic variables were not obtained directly but derived from three monthly meteorological variables: mean minimum temperature (°C), mean maximum temperature (°C), and total precipitation (mm). These variables were downscaled from CRU-TS-4.03 (Harris et al., 2014) with WorldClim 2.1 (Fick & Hijmans, 2017) for bias correction. The nineteen WorldClim variables were calculated from these three monthly meteorological variables using the "biovars" function of the R dismo package (Hijmans et al., 2011). Unlike the historical data, pre-processed bioclimatic variables for the future could be accessed directly. We used a multimodel ensemble approach, which tends to perform better than any individual model (Pierce et al., 2009; Araújo & New, 2007). The ensemble integrates mean outputs from 25 global climate models (GCMs) corresponding to an array of twelve different future climate change scenarios (Harris et al., 2014; Fick & Hijmans, 2017). These scenarios emerge from the interplay of four specific timeframes (2021-2040, 2041-2060, 2061-2080, and 2081-2100) and three Shared Socio-economic Pathways (ssp2-45, ssp3-70, and ssp5-85) (Gidden et al., 2019). Feeding Habits Data The feeding habits of bird and mammal species were obtained from the global species-level compilation of key trophic attributes, known as Elton traits 1.0 (Wilman et al., 2014). This dataset provided essential information on the trophic roles of species, which is crucial for understanding their ecological interactions and energy flow within ecosystems. Trophic profile of the cells and structure identification Trophic profile of the cells We assigned each of the 15,265 terrestrial mammal and non-marine bird species to one of 9 trophic guilds and then counted the number of species in each guild within each cell, following a previous analysis (Mendoza & Araújo, 2022). The result is a matrix with the 9 trophic guilds as columns, 14,498 cells as rows, and values representing numbers of species. The trophic profile of every community is thus a point in a 9-dimensional ‘trophic space' defined by the number of species from each trophic guild (a vector of dimension 9). Selection of training samples From the initial set of 14,498 terrestrial grid cells, each measuring 1°×1°, a specific subset of 6,610 continental cells was selected. This subset was defined by their overlap, either partial or complete, with designated protected areas. This subset was crucial for two analytical steps: first, to decipher the community trophic structures; and second, to model the interaction between the prevailing climate and the trophic structure. Given the nature of these cells — designated as "continental protected area cells" — we assume they experience reduced human activity compared to the surrounding matrix; an assumption that may not align with reality globally, considering evidence of reduced effectiveness of protected areas in ensuring tangible protection in various parts of the tropics (Geldmann et al., 2019). Nevertheless, a working assumption is made that the trophic structures displayed within these areas likely present a closer reflection of what might be expected from an undisturbed, stable energy network (Mendoza & Araújo, 2022). Identification of the six basic trophic structures through AMD analysis We utilized AMD analysis to explore the previously described 9-dimensional 'community trophic space', defined by the number of species within each trophic guild. This analysis is rooted in computing the Average Membership Degree (AMD) of cluster elements based on their Euclidean distance to the geometric center. The primary aim of AMD analysis is to discern the presence of distinct groups within multidimensional spaces, while concurrently assessing their degree of definition and compactness. The emergence of well-defined community groups within this trophic space allows for the consideration of the identified basic trophic structures as qualitatively distinct entities (Mendoza & Araújo, 2022). We applied AMD analysis to the 6,610 continental protected area cells to confirm that the same six basic trophic structures (TS1 to TS6) identified by Mendoza & Araújo (2022) are present within this curated subset. For a more comprehensive understanding of the AMD method and its application to our dataset, readers are directed to the supplementary information of Mendoza & Araújo (2022), accessible via the following link: https://nsojournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2Fecog.06289&file=ecog12872-sup-0001-AppendixS1.pdf Climate modelling of community trophic structures Data preparation We modelled the relationship between climate and trophic structures, utilizing 19 predictors derived from historical bioclimatic data encompassing the years 1961-1990. Denoted as pre-1990 period, this phase marks a time before the significant uptick in temperatures attributable to human-induced greenhouse gas emissions. The trophic profile data, systematically assembled from faunal lists gathered over numerous decades, also hail from an era prior to this pronounced temperature increase. Therefore, these records present a fitting basis for examining the interplay between the trophic structure and the climatic conditions prevalent during the pre-1990 period. The bioclimatic variables represent conditions over specific time periods, and the corresponding trophic structure type (TS1 to TS6) is inferred as the one expected at the end of these periods. Model Implementation Using Xgboost We employed the Extreme Gradient Boosting algorithm (Xgboost) (Chen & Guestrin, 2016), using the xgboost package (Chen et al., 2023), a state-of-the-art machine learning technique known for its superior performance over traditional models such as random forests (e.g., Shao et al., 2024). The target variable in our analysis was the basic type of trophic structure (TS1 to TS6), identified in the previous step (with the AMD analysis) in the 6,610 continental protected area cells. Hyperparameter optimization Before training the model, we optimized the hyperparameters of the Xgboost algorithm to enhance its performance. Specifically, we focused on six parameters: learning rate, maximum tree depth, gamma, lambda, alpha, and the number of trees. Due to the enormous number of possible parameter combinations, we employed a Bayesian optimization approach, which provided a more efficient search over the hyperparameter space compared to traditional grid search. As an optimization criterion, we used the xgb.cv cross-validation function within the Xgboost package, based on k-fold cross-validation. Spatial cross-validation by blocks In order to thoroughly assess the predictive accuracy of our model and address the spatial autocorrelation inherent in ecological data, we employed a rigorous Spatial Cross-Validation by Blocks method. This approach entailed partitioning the 6,610 continental protected area cells into 3,848 validation blocks,

  7. m

    2033 年综合报告 Python软件软件市场 规模、份额和行业洞察

    • marketresearchintellect.com
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect (2024). 2033 年综合报告 Python软件软件市场 规模、份额和行业洞察 [Dataset]. https://www.marketresearchintellect.com/zh/product/global-python-package-software-market-size-forecast/
    Explore at:
    Dataset updated
    Aug 20, 2024
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/zh/privacy-policyhttps://www.marketresearchintellect.com/zh/privacy-policy

    Area covered
    Global
    Description

    Learn more about Market Research Intellect's Python Package Software Market Report, valued at USD 700 million in 2024, and set to grow to USD 1.5 billion by 2033 with a CAGR of 9.5% (2026-2033).

  8. m

    Complète Marché des logiciels de package Python Taille, part et perspectives...

    • marketresearchintellect.com
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect (2025). Complète Marché des logiciels de package Python Taille, part et perspectives sectorielles 2033 [Dataset]. https://www.marketresearchintellect.com/fr/product/global-python-package-software-market-size-forecast/
    Explore at:
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/fr/privacy-policyhttps://www.marketresearchintellect.com/fr/privacy-policy

    Area covered
    Global
    Description

    La taille et la part de marché sont classées selon Data Analysis (NumPy, Pandas, SciPy, Dask, Vaex) and Web Development (Flask, Django, FastAPI, Pyramid, Bottle) and Machine Learning (TensorFlow, Scikit-learn, Keras, PyTorch, XGBoost) and Visualization (Matplotlib, Seaborn, Plotly, Bokeh, Altair) and Automation and Scripting (Requests, Beautiful Soup, Selenium, PyAutoGUI, Fabric) and régions géographiques (Amérique du Nord, Europe, Asie-Pacifique, Amérique du Sud, Moyen-Orient et Afrique).

  9. f

    Table_3_Preliminary prediction of semen quality based on modifiable...

    • frontiersin.figshare.com
    docx
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu (2023). Table_3_Preliminary prediction of semen quality based on modifiable lifestyle factors by using the XGBoost algorithm.docx [Dataset]. http://doi.org/10.3389/fmed.2022.811890.s004
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers
    Authors
    Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionSemen quality has decreased gradually in recent years, and lifestyle changes are among the primary causes for this issue. Thus far, the specific lifestyle factors affecting semen quality remain to be elucidated.Materials and methodsIn this study, data on the following factors were collected from 5,109 men examined at our reproductive medicine center: 10 lifestyle factors that potentially affect semen quality (smoking status, alcohol consumption, staying up late, sleeplessness, consumption of pungent food, intensity of sports activity, sedentary lifestyle, working in hot conditions, sauna use in the last 3 months, and exposure to radioactivity); general factors including age, abstinence period, and season of semen examination; and comprehensive semen parameters [semen volume, sperm concentration, progressive and total sperm motility, sperm morphology, and DNA fragmentation index (DFI)]. Then, machine learning with the XGBoost algorithm was applied to establish a primary prediction model by using the collected data. Furthermore, the accuracy of the model was verified via multiple logistic regression following k-fold cross-validation analyses.ResultsThe results indicated that for semen volume, sperm concentration, progressive and total sperm motility, and DFI, the area under the curve (AUC) values ranged from 0.648 to 0.697, while the AUC for sperm morphology was only 0.506. Among the 13 factors, smoking status was the major factor affecting semen volume, sperm concentration, and progressive and total sperm motility. Age was the most important factor affecting DFI. Logistic combined with cross-validation analysis revealed similar results. Furthermore, it showed that heavy smoking (>20 cigarettes/day) had an overall negative effect on semen volume and sperm concentration and progressive and total sperm motility (OR = 4.69, 6.97, 11.16, and 10.35, respectively), while age of >35 years was associated with increased DFI (OR = 5.47).ConclusionThe preliminary lifestyle-based model developed for semen quality prediction by using the XGBoost algorithm showed potential for clinical application and further optimization with larger training datasets.

  10. f

    Table_8_Preliminary prediction of semen quality based on modifiable...

    • frontiersin.figshare.com
    docx
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu (2023). Table_8_Preliminary prediction of semen quality based on modifiable lifestyle factors by using the XGBoost algorithm.docx [Dataset]. http://doi.org/10.3389/fmed.2022.811890.s009
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Mingjuan Zhou; Tianci Yao; Jian Li; Hui Hui; Weimin Fan; Yunfeng Guan; Aijun Zhang; Bufang Xu
    Description

    IntroductionSemen quality has decreased gradually in recent years, and lifestyle changes are among the primary causes for this issue. Thus far, the specific lifestyle factors affecting semen quality remain to be elucidated.Materials and methodsIn this study, data on the following factors were collected from 5,109 men examined at our reproductive medicine center: 10 lifestyle factors that potentially affect semen quality (smoking status, alcohol consumption, staying up late, sleeplessness, consumption of pungent food, intensity of sports activity, sedentary lifestyle, working in hot conditions, sauna use in the last 3 months, and exposure to radioactivity); general factors including age, abstinence period, and season of semen examination; and comprehensive semen parameters [semen volume, sperm concentration, progressive and total sperm motility, sperm morphology, and DNA fragmentation index (DFI)]. Then, machine learning with the XGBoost algorithm was applied to establish a primary prediction model by using the collected data. Furthermore, the accuracy of the model was verified via multiple logistic regression following k-fold cross-validation analyses.ResultsThe results indicated that for semen volume, sperm concentration, progressive and total sperm motility, and DFI, the area under the curve (AUC) values ranged from 0.648 to 0.697, while the AUC for sperm morphology was only 0.506. Among the 13 factors, smoking status was the major factor affecting semen volume, sperm concentration, and progressive and total sperm motility. Age was the most important factor affecting DFI. Logistic combined with cross-validation analysis revealed similar results. Furthermore, it showed that heavy smoking (>20 cigarettes/day) had an overall negative effect on semen volume and sperm concentration and progressive and total sperm motility (OR = 4.69, 6.97, 11.16, and 10.35, respectively), while age of >35 years was associated with increased DFI (OR = 5.47).ConclusionThe preliminary lifestyle-based model developed for semen quality prediction by using the XGBoost algorithm showed potential for clinical application and further optimization with larger training datasets.

  11. f

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

    • figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

  12. f

    Raw data.

    • figshare.com
    bin
    Updated Aug 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen (2023). Raw data. [Dataset]. http://doi.org/10.1371/journal.pone.0289621.s002
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yuan Liu; Wenyi Du; Yi Guo; Zhiqiang Tian; Wei Shen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundColon cancer recurrence is a common adverse outcome for patients after complete mesocolic excision (CME) and greatly affects the near-term and long-term prognosis of patients. This study aimed to develop a machine learning model that can identify high-risk factors before, during, and after surgery, and predict the occurrence of postoperative colon cancer recurrence.MethodsThe study included 1187 patients with colon cancer, including 110 patients who had recurrent colon cancer. The researchers collected 44 characteristic variables, including patient demographic characteristics, basic medical history, preoperative examination information, type of surgery, and intraoperative information. Four machine learning algorithms, namely extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and k-nearest neighbor algorithm (KNN), were used to construct the model. The researchers evaluated the model using the k-fold cross-validation method, ROC curve, calibration curve, decision curve analysis (DCA), and external validation.ResultsAmong the four prediction models, the XGBoost algorithm performed the best. The ROC curve results showed that the AUC value of XGBoost was 0.962 in the training set and 0.952 in the validation set, indicating high prediction accuracy. The XGBoost model was stable during internal validation using the k-fold cross-validation method. The calibration curve demonstrated high predictive ability of the XGBoost model. The DCA curve showed that patients who received interventional treatment had a higher benefit rate under the XGBoost model. The external validation set’s AUC value was 0.91, indicating good extrapolation of the XGBoost prediction model.ConclusionThe XGBoost machine learning algorithm-based prediction model for colon cancer recurrence has high prediction accuracy and clinical utility.

  13. f

    Table 3_A machine learning model to predict neurological deterioration after...

    • frontiersin.figshare.com
    docx
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daisu Abe; Motoki Inaji; Takeshi Hase; Eiichi Suehiro; Naoto Shiomi; Hiroshi Yatsushige; Shin Hirota; Shu Hasegawa; Hiroshi Karibe; Akihiro Miyata; Kenya Kawakita; Kohei Haji; Hideo Aihara; Shoji Yokobori; Takeshi Maeda; Takahiro Onuki; Kotaro Oshio; Nobukazu Komoribayashi; Michiyasu Suzuki; Taketoshi Maehara (2025). Table 3_A machine learning model to predict neurological deterioration after mild traumatic brain injury in older adults.docx [Dataset]. http://doi.org/10.3389/fneur.2024.1502153.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 3, 2025
    Dataset provided by
    Frontiers
    Authors
    Daisu Abe; Motoki Inaji; Takeshi Hase; Eiichi Suehiro; Naoto Shiomi; Hiroshi Yatsushige; Shin Hirota; Shu Hasegawa; Hiroshi Karibe; Akihiro Miyata; Kenya Kawakita; Kohei Haji; Hideo Aihara; Shoji Yokobori; Takeshi Maeda; Takahiro Onuki; Kotaro Oshio; Nobukazu Komoribayashi; Michiyasu Suzuki; Taketoshi Maehara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveNeurological deterioration after mild traumatic brain injury (TBI) has been recognized as a poor prognostic factor. Early detection of neurological deterioration would allow appropriate monitoring and timely therapeutic interventions to improve patient outcomes. In this study, we developed a machine learning model to predict the occurrence of neurological deterioration after mild TBI using information obtained on admission.MethodsThis was a retrospective cohort study of data from the Think FAST registry, a multicenter prospective observational study of elderly TBI patients in Japan. Patients with an admission Glasgow Coma Scale (GCS) score of 12 or below or who underwent surgical treatment immediately upon admission were excluded. Neurological deterioration was defined as a decrease of 2 or more points from a GCS score of 13 or more within 24 h of hospital admission. The model predictive accuracy was judged with the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC), and the Youden index was used to determine the cutoff value.ResultsA total of 421 of 721 patients registered in the Think FAST registry between December 2019 and May 2021 were included in our study, among whom 25 demonstrated neurological deterioration. Among several machine learning algorithms, eXtreme Gradient Boosting (XGBoost) demonstrated the highest predictive accuracy in cross-validation, with an AUROC of 0.81 (±0.07) and an AUPRC of 0.33 (±0.08). Through SHapley Additive exPlanations (SHAP) analysis, five important features (D-dimer, fibrinogen, acute subdural hematoma thickness, cerebral contusion size, and systolic blood pressure) were identified and used to construct a better performing model (cross-validation AUROC of 0.84 and AUPRC of 0.34; testing data AUROC of 0.77 and AUPRC of 0.19). At the cutoff value from the Youden index, the model showed a sensitivity, specificity, and positive predictive value of 60, 96, and 38%, respectively. When neurosurgeons attempted to predict neurological deterioration using the same testing data, their values were 20, 94, and 19%, respectively.ConclusionIn this study, our predictive model showed an acceptable performance in detecting neurological deterioration after mild TBI. Further validation through prospective studies is necessary to confirm these results.

  14. f

    The five cross-validation stages involved in the present study.

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lifeng Wu; Junliang Fan (2023). The five cross-validation stages involved in the present study. [Dataset]. http://doi.org/10.1371/journal.pone.0217520.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lifeng Wu; Junliang Fan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The five cross-validation stages involved in the present study.

  15. f

    Descriptive statistics of data set.

    • plos.figshare.com
    xls
    Updated Jun 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davood Fereidooni; Zohre Karimi; Fatemeh Ghasemi (2024). Descriptive statistics of data set. [Dataset]. http://doi.org/10.1371/journal.pone.0302944.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Davood Fereidooni; Zohre Karimi; Fatemeh Ghasemi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The uniaxial compressive strength (UCS) and elasticity modulus (E) of intact rock are two fundamental requirements in engineering applications. These parameters can be measured either directly from the uniaxial compressive strength test or indirectly by using soft computing predictive models. In the present research, the UCS and E of intact carbonate rocks have been predicted by introducing two stacking ensemble learning models from non-destructive simple laboratory test results. For this purpose, dry unit weight, porosity, P‐wave velocity, Brinell surface harnesses, UCS, and static E were measured for 70 carbonate rock samples. Then, two stacking ensemble learning models were developed for estimating the UCS and E of the rocks. The applied stacking ensemble learning method integrates the advantages of two base models in the first level, where base models are multi-layer perceptron (MLP) and random forest (RF) for predicting UCS, and support vector regressor (SVR) and extreme gradient boosting (XGBoost) for predicting E. Grid search integrating k-fold cross validation is applied to tune the parameters of both base models and meta-learner. The results demonstrate the generalization ability of the stacking ensemble method in the comparison of base models in the terms of common performance measures. The values of coefficient of determination (R2) obtained from the stacking ensemble are 0.909 and 0.831 for predicting UCS and E, respectively. Similarly, the stacking ensemble yielded Root Mean Squared Error (RMSE) values of 1.967 and 0.621 for the prediction of UCS and E, respectively. Accordingly, the proposed models have superiority in the comparison of SVR and MLP as single models and RF and XGBoost as two representative ensemble models. Furthermore, sensitivity analysis is carried out to investigate the impact of input parameters.

  16. f

    Data_Sheet_1_Machine learning strategy for identifying altered gut...

    • frontiersin.figshare.com
    docx
    Updated Sep 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Che-Cheng Chang; Tzu-Chi Liu; Chi-Jie Lu; Hou-Chang Chiu; Wei-Ning Lin (2023). Data_Sheet_1_Machine learning strategy for identifying altered gut microbiomes for diagnostic screening in myasthenia gravis.docx [Dataset]. http://doi.org/10.3389/fmicb.2023.1227300.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Sep 27, 2023
    Dataset provided by
    Frontiers
    Authors
    Che-Cheng Chang; Tzu-Chi Liu; Chi-Jie Lu; Hou-Chang Chiu; Wei-Ning Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Myasthenia gravis (MG) is a neuromuscular junction disease with a complex pathophysiology and clinical variation for which no clear biomarker has been discovered. We hypothesized that because changes in gut microbiome composition often occur in autoimmune diseases, the gut microbiome structures of patients with MG would differ from those without, and supervised machine learning (ML) analysis strategy could be trained using data from gut microbiota for diagnostic screening of MG. Genomic DNA from the stool samples of MG and those without were collected and established a sequencing library by constructing amplicon sequence variants (ASVs) and completing taxonomic classification of each representative DNA sequence. Four ML methods, namely least absolute shrinkage and selection operator, extreme gradient boosting (XGBoost), random forest, and classification and regression trees with nested leave-one-out cross-validation were trained using ASV taxon–based data and full ASV–based data to identify key ASVs in each data set. The results revealed XGBoost to have the best predicted performance. Overlapping key features extracted when XGBoost was trained using the full ASV–based and ASV taxon–based data were identified, and 31 high-importance ASVs (HIASVs) were obtained, assigned importance scores, and ranked. The most significant difference observed was in the abundance of bacteria in the Lachnospiraceae and Ruminococcaceae families. The 31 HIASVs were used to train the XGBoost algorithm to differentiate individuals with and without MG. The model had high diagnostic classification power and could accurately predict and identify patients with MG. In addition, the abundance of Lachnospiraceae was associated with limb weakness severity. In this study, we discovered that the composition of gut microbiomes differed between MG and non-MG subjects. In addition, the proposed XGBoost model trained using 31 HIASVs had the most favorable performance with respect to analyzing gut microbiomes. These HIASVs selected by the ML model may serve as biomarkers for clinical use and mechanistic study in the future. Our proposed ML model can identify several taxonomic markers and effectively discriminate patients with MG from those without with a high accuracy, the ML strategy can be applied as a benchmark to conduct noninvasive screening of MG.

  17. f

    Accuracy and p-value obtained from 5-fold cross validation for three machine...

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Zhang; Jennifer L. Pechal; Carl J. Schmidt; Heather R. Jordan; Wesley W. Wang; M. Eric Benbow; Sing-Hoi Sze; Aaron M. Tarone (2023). Accuracy and p-value obtained from 5-fold cross validation for three machine learning methods (xgboost, random forest and neural network) for the prediction of postmortem interval, event location and manner of death using the microbiota from all anatomic locations (ears, eyes, nose, mouth, and rectum). [Dataset]. http://doi.org/10.1371/journal.pone.0213829.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yu Zhang; Jennifer L. Pechal; Carl J. Schmidt; Heather R. Jordan; Wesley W. Wang; M. Eric Benbow; Sing-Hoi Sze; Aaron M. Tarone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accuracy and p-value obtained from 5-fold cross validation for three machine learning methods (xgboost, random forest and neural network) for the prediction of postmortem interval, event location and manner of death using the microbiota from all anatomic locations (ears, eyes, nose, mouth, and rectum).

  18. f

    Data Sheet 1_Predicting the risk of gastroparesis in critically ill patients...

    • figshare.com
    xlsx
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan Liu; Songyun Zhao; Wenyi Du; Wei Shen; Ning Zhou (2025). Data Sheet 1_Predicting the risk of gastroparesis in critically ill patients after CME using an interpretable machine learning algorithm – a 10-year multicenter retrospective study.xlsx [Dataset]. http://doi.org/10.3389/fmed.2024.1467565.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    Frontiers
    Authors
    Yuan Liu; Songyun Zhao; Wenyi Du; Wei Shen; Ning Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundGastroparesis following complete mesocolic excision (CME) can precipitate a cascade of severe complications, which may significantly hinder postoperative recovery and diminish the patient’s quality of life. In the present study, four advanced machine learning algorithms—Extreme Gradient Boosting (XGBoost), Random Forest (RF), Support Vector Machine (SVM), and k-nearest neighbor (KNN)—were employed to develop predictive models. The clinical data of critically ill patients transferred to the intensive care unit (ICU) post-CME were meticulously analyzed to identify key risk factors associated with the development of gastroparesis.MethodsWe gathered 34 feature variables from a cohort of 1,097 colon cancer patients, including 87 individuals who developed gastroparesis post-surgery, across multiple hospitals, and applied a range of machine learning algorithms to construct the predictive model. To assess the model’s generalization performance, we employed 10-fold cross-validation, while the receiver operating characteristic (ROC) curve was utilized to evaluate its discriminative capacity. Additionally, calibration curves, decision curve analysis (DCA), and external validation were integrated to provide a comprehensive evaluation of the model’s clinical applicability and utility.ResultsAmong the four predictive models, the XGBoost algorithm demonstrated superior performance. As indicated by the ROC curve, XGBoost achieved an area under the curve (AUC) of 0.939 in the training set and 0.876 in the validation set, reflecting exceptional predictive accuracy. Notably, in the k-fold cross-validation, the XGBoost model exhibited robust consistency across all folds, underscoring its stability. The calibration curve further revealed a favorable concordance between the predicted probabilities and the actual outcomes of the XGBoost model. Additionally, the DCA highlighted that patients receiving intervention under the XGBoost model experienced significantly greater clinical benefit.ConclusionThe onset of postoperative gastroparesis in colon cancer patients remains an elusive challenge to entirely prevent. However, the prediction model developed in this study offers valuable assistance to clinicians in identifying key high-risk factors for gastroparesis, thereby enhancing the quality of life and survival outcomes for these patients.

  19. f

    MCC as the object of difference analysis: 10-fold cross-validation...

    • plos.figshare.com
    xls
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hu Ai (2023). MCC as the object of difference analysis: 10-fold cross-validation classification metrics of the top three genes. [Dataset]. http://doi.org/10.1371/journal.pone.0263171.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Hu Ai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MCC as the object of difference analysis: 10-fold cross-validation classification metrics of the top three genes.

  20. f

    Table_2_Local and Distributed Machine Learning for Inter-hospital Data...

    • frontiersin.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo R. Lopes; Marco Mamprin; Jo M. Zelis; Pim A. L. Tonino; Martijn S. van Mourik; Marije M. Vis; Svitlana Zinger; Bas A. J. M. de Mol; Peter H. N. de With; Henk A. Marquering (2023). Table_2_Local and Distributed Machine Learning for Inter-hospital Data Utilization: An Application for TAVI Outcome Prediction.docx [Dataset]. http://doi.org/10.3389/fcvm.2021.787246.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Ricardo R. Lopes; Marco Mamprin; Jo M. Zelis; Pim A. L. Tonino; Martijn S. van Mourik; Marije M. Vis; Svitlana Zinger; Bas A. J. M. de Mol; Peter H. N. de With; Henk A. Marquering
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Machine learning models have been developed for numerous medical prognostic purposes. These models are commonly developed using data from single centers or regional registries. Including data from multiple centers improves robustness and accuracy of prognostic models. However, data sharing between multiple centers is complex, mainly because of regulations and patient privacy issues.Objective: We aim to overcome data sharing impediments by using distributed ML and local learning followed by model integration. We applied these techniques to develop 1-year TAVI mortality estimation models with data from two centers without sharing any data.Methods: A distributed ML technique and local learning followed by model integration was used to develop models to predict 1-year mortality after TAVI. We included two populations with 1,160 (Center A) and 631 (Center B) patients. Five traditional ML algorithms were implemented. The results were compared to models created individually on each center.Results: The combined learning techniques outperformed the mono-center models. For center A, the combined local XGBoost achieved an AUC of 0.67 (compared to a mono-center AUC of 0.65) and, for center B, a distributed neural network achieved an AUC of 0.68 (compared to a mono-center AUC of 0.64).Conclusion: This study shows that distributed ML and combined local models techniques, can overcome data sharing limitations and result in more accurate models for TAVI mortality estimation. We have shown improved prognostic accuracy for both centers and can also be used as an alternative to overcome the problem of limited amounts of data when creating prognostic models.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tawara (2020). MoA Weight: XGBoost [Dataset]. https://www.kaggle.com/datasets/ttahara/moa-weight-xgb-seed-cv/suggestions
Organization logo

MoA Weight: XGBoost

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 4, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tawara
Description

Dataset

This dataset was created by Tawara

Released under CC0: Public Domain

Contents

Search
Clear search
Close search
Google apps
Main menu