31 datasets found
  1. Hyperparameters for the XGBoost model.

    • plos.figshare.com
    xls
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.

  2. f

    Model hyperparameters used for the XGBoost model.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saikia, Bhaskar Jyoti; Scaria, Vinod; Kumar, Mukesh; K. , Binukumar B.; Vatsyayan, Aastha (2024). Model hyperparameters used for the XGBoost model. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001281231
    Explore at:
    Dataset updated
    May 17, 2024
    Authors
    Saikia, Bhaskar Jyoti; Scaria, Vinod; Kumar, Mukesh; K. , Binukumar B.; Vatsyayan, Aastha
    Description

    BackgroundAdvances in Next Generation Sequencing have made rapid variant discovery and detection widely accessible. To facilitate a better understanding of the nature of these variants, American College of Medical Genetics and Genomics and the Association of Molecular Pathologists (ACMG-AMP) have issued a set of guidelines for variant classification. However, given the vast number of variants associated with any disorder, it is impossible to manually apply these guidelines to all known variants. Machine learning methodologies offer a rapid way to classify large numbers of variants, as well as variants of uncertain significance as either pathogenic or benign. Here we classify ATP7B genetic variants by employing ML and AI algorithms trained on our well-annotated WilsonGen dataset.MethodsWe have trained and validated two algorithms: TabNet and XGBoost on a high-confidence dataset of manually annotated, ACMG & AMP classified variants of the ATP7B gene associated with Wilson’s Disease.ResultsUsing an independent validation dataset of ACMG & AMP classified variants, as well as a patient set of functionally validated variants, we showed how both algorithms perform and can be used to classify large numbers of variants in clinical as well as research settings.ConclusionWe have created a ready to deploy tool, that can classify variants linked with Wilson’s disease as pathogenic or benign, which can be utilized by both clinicians and researchers to better understand the disease through the nature of genetic variants associated with it.

  3. f

    CMAB-The World's First National-Scale Multi-Attribute Building Dataset

    • figshare.com
    bin
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yecheng Zhang; Huimin Zhao; Ying Long (2025). CMAB-The World's First National-Scale Multi-Attribute Building Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27992417.v7
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 20, 2025
    Dataset provided by
    figshare
    Authors
    Yecheng Zhang; Huimin Zhao; Ying Long
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper presents the first national-scale Multi-Attribute Building dataset (CMAB) with artificial intelligence, covering 3,667 spatial cities, 31 million buildings, and 23.6 billion m² of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 363 billion m³ of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating morphology, location, and function features. Using multi-source data, including billions of remote sensing images and 60 million street view images (SVIs), we generated rooftop, height, structure, function, style, age, and quality attributes for each building with machine learning and large multimodal models. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.Data records: A building dataset with a total rooftop area of 23.6 billion square meters in 3,667 natural cities in China, including the attribute of building rooftop, height, structure, function, age, style and quality, as well as the code files used to calculate these data. The deep learning models used are OCRNet, XGBoost, fine-tuned CLIP and Yolo-v8.Supplementary note: The architectural structure, style, and quality are affected by the temporal and spatial distribution of street views in China. Regarding the recognition of building colors, we found that the existing CLIP series model can not accurately judge the composition and proportion of building colors, and then it will be accurately calculated and supplemented by semantic segmentation and image processing. Please contact zhangyec23@mails.tsinghua.edu.cn or ylong@tsinghua.edu.cn if you have any technical problems.Reference Format: Zhang, Y., Zhao, H. & Long, Y. CMAB: A Multi-Attribute Building Dataset of China. Sci Data 12, 430 (2025). https://doi.org/10.1038/s41597-025-04730-5.

  4. f

    Data from: Extreme Gradient Boosting as a Method for Quantitative...

    • acs.figshare.com
    • figshare.com
    zip
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford (2023). Extreme Gradient Boosting as a Method for Quantitative Structure–Activity Relationships [Dataset]. http://doi.org/10.1021/acs.jcim.6b00591.s033
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.

  5. Electricity Demand Historical Data

    • kaggle.com
    zip
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science Lovers (2025). Electricity Demand Historical Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/electricity-demand-data-dsl
    Explore at:
    zip(968020 bytes)Available download formats
    Dataset updated
    Jul 26, 2025
    Authors
    Data Science Lovers
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    📹 Project Video available on YouTube - https://youtu.be/iop8TUxmgO0

    🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

    Electricity Demand Forecasting Dataset (XGBoost Model Ready)

    This dataset contains historical information of 5 years to help predict electricity demand using machine learning, especially with models like XGBoost. It includes features such as temperature, humidity, wind speed, and past electricity usage across different time intervals.

    The dataset is designed to help you learn and build models that can forecast how much electricity people might use in the future. This is useful for energy companies, smart grids, and power management systems.

    The Features/Columns available in the dataset are :

    • Timestamp: The date of the observation
    • Demand: Actual electricity demand at that time (target variable)
    • Temperature: Temperature in degrees Celsius
    • Humidity: Humidity percentage
    • Hour: Hour of the day (0–23)
    • DayOfWeek: Day of the week (0 = Monday, 6 = Sunday)
    • Month: Month number (1 = January, 12 = December)
    • Year: Year of the observation

    Potential Use Cases :

    -Build regression models to forecast electricity demand

    -Use lag and rolling features in time series models

    -Compare performance of ML algorithms like XGBoost, Random Forest, and LSTM

    -Learn how environmental and time-based factors affect electricity usage

  6. Table_1_A cost-effective, machine learning-driven approach for screening...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    doc
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rujia Miao; Qian Dong; Xuelian Liu; Yingying Chen; Jiangang Wang; Jianwen Chen (2024). Table_1_A cost-effective, machine learning-driven approach for screening arterial functional aging in a large-scale Chinese population.DOC [Dataset]. http://doi.org/10.3389/fpubh.2024.1365479.s002
    Explore at:
    docAvailable download formats
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Rujia Miao; Qian Dong; Xuelian Liu; Yingying Chen; Jiangang Wang; Jianwen Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionAn easily accessible and cost-free machine learning model based on prior probabilities of vascular aging enables an application to pinpoint high-risk populations before physical checks and optimize healthcare investment.MethodsA dataset containing questionnaire responses and physical measurement parameters from 77,134 adults was extracted from the electronic records of the Health Management Center at the Third Xiangya Hospital. The least absolute shrinkage and selection operator and recursive feature elimination-Lightweight Gradient Elevator were employed to select features from a pool of potential covariates. The participants were randomly divided into training (70%) and test cohorts (30%). Four machine learning algorithms were applied to build the screening models for elevated arterial stiffness (EAS), and the performance of models was evaluated by calculating the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy.ResultsFourteen easily accessible features were selected to construct the model, including “systolic blood pressure” (SBP), “age,” “waist circumference,” “history of hypertension,” “sex,” “exercise,” “awareness of normal blood pressure,” “eat fruit,” “work intensity,” “drink milk,” “eat bean products,” “smoking,” “alcohol consumption,” and “Irritableness.” The extreme gradient boosting (XGBoost) model outperformed the other three models, achieving AUC values of 0.8722 and 0.8710 in the training and test sets, respectively. The most important five features are SBP, age, waist, history of hypertension, and sex.ConclusionThe XGBoost model ideally assesses the prior probability of the current EAS in the general population. The integration of the model into primary care facilities has the potential to lower medical expenses and enhance the management of arterial aging.

  7. d

    A machine learning based prediction model for life expectancy

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 14, 2022
    Dataset provided by
    Dryad
    Authors
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
    Time period covered
    Oct 12, 2022
    Description

    Microsoft Excel

  8. Fraud Detection Transactions Dataset

    • kaggle.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset
    Explore at:
    zip(2104444 bytes)Available download formats
    Dataset updated
    Feb 21, 2025
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

    📌 Key Features

    1. 21 features capturing various aspects of a financial transaction
    2. Realistic structure with numerical, categorical, and temporal data
    3. Binary fraud labels (0 = Not Fraud, 1 = Fraud)
    4. Designed for high accuracy with XGBoost and other ML models
    5. Useful for anomaly detection, risk analysis, and security research

    📌 Columns in the Dataset

    Column NameDescription
    Transaction_IDUnique identifier for each transaction
    User_IDUnique identifier for the user
    Transaction_AmountAmount of money involved in the transaction
    Transaction_TypeType of transaction (Online, In-Store, ATM, etc.)
    TimestampDate and time of the transaction
    Account_BalanceUser's current account balance before the transaction
    Device_TypeType of device used (Mobile, Desktop, etc.)
    LocationGeographical location of the transaction
    Merchant_CategoryType of merchant (Retail, Food, Travel, etc.)
    IP_Address_FlagWhether the IP address was flagged as suspicious (0 or 1)
    Previous_Fraudulent_ActivityNumber of past fraudulent activities by the user
    Daily_Transaction_CountNumber of transactions made by the user that day
    Avg_Transaction_Amount_7dUser's average transaction amount in the past 7 days
    Failed_Transaction_Count_7dCount of failed transactions in the past 7 days
    Card_TypeType of payment card used (Credit, Debit, Prepaid, etc.)
    Card_AgeAge of the card in months
    Transaction_DistanceDistance between the user's usual location and transaction location
    Authentication_MethodHow the user authenticated (PIN, Biometric, etc.)
    Risk_ScoreFraud risk score computed for the transaction
    Is_WeekendWhether the transaction occurred on a weekend (0 or 1)
    Fraud_LabelTarget variable (0 = Not Fraud, 1 = Fraud)

    📌 Potential Use Cases

    1. Fraud detection model training
    2. Anomaly detection in financial transactions
    3. Risk scoring systems for banks and fintech companies
    4. Feature engineering and model explainability research
  9. Surrogate models used in the study.

    • plos.figshare.com
    xls
    Updated Oct 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović (2024). Surrogate models used in the study. [Dataset]. http://doi.org/10.1371/journal.pone.0312573.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 25, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Surrogate optimisation holds a big promise for building energy optimisation studies due to its goal to replace the use of lengthy building energy simulations within an optimisation step with expendable local surrogate models that can quickly predict simulation results. To be useful for such purpose, it should be possible to quickly train precise surrogate models from a small number of simulation results (10–100) obtained from appropriately sampled points in the desired part of the design space. Two sampling methods and two machine learning models are compared here. Latin hypercube sampling (LHS), widely accepted in building energy community, is compared to an exploratory Monte Carlo-based sequential design method mc-intersite-proj-th (MIPT). Artificial neural networks (ANN), also widely accepted in building energy community, are compared to gradient-boosted tree ensembles (XGBoost), model of choice in many machine learning competitions. In order to get a better understanding of the behaviour of these two sampling methods and two machine learning models, we compare their predictions against a large set of generated synthetic data. For this purpose, a simple case study of an office cell model with a single window and a fixed overhang, whose main input parameters are overhang depth and height, while climate type, presence of obstacles, orientation and heating and cooling set points are additional input parameters, was extensively simulated with EnergyPlus, to form a large underlying dataset of 729,000 simulation results. Expendable local surrogate models for predicting simulated heating, cooling and lighting loads and equivalent primary energy needs of the office cell were trained using both LHS and MIPT and both ANN and XGBoost for several main hyperparameter choices. Results show that XGBoost models are more precise than ANN models, and that for both machine learning models, the use of MIPT sampling leads to more precise surrogates than LHS.

  10. The pseudo-code for the XGBoost-Shap.

    • plos.figshare.com
    xls
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sixuan Wang; Bin Luo (2024). The pseudo-code for the XGBoost-Shap. [Dataset]. http://doi.org/10.1371/journal.pone.0309838.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sixuan Wang; Bin Luo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Student academic achievement is an important indicator for evaluating the quality of education, especially, the achievement prediction empowers educators in tailoring their instructional approaches, thereby fostering advancements in both student performance and the overall educational quality. However, extracting valuable insights from vast educational data to develop effective strategies for evaluating student performance remains a significant challenge for higher education institutions. Traditional machine learning (ML) algorithms often struggle to clearly delineate the interplay between the factors that influence academic success and the resulting grades. To address these challenges, this paper introduces the XGB-SHAP model, a novel approach for predicting student achievement that combines Extreme Gradient Boosting (XGBoost) with SHapley Additive exPlanations (SHAP). The model was applied to a dataset from a public university in Wuhan, encompassing the academic records of 87 students who were enrolled in a Japanese course between September 2021 and June 2023. The findings indicate the model excels in accuracy, achieving a Mean absolute error (MAE) of approximately 6 and an R-squared value near 0.82, surpassing three other ML models. The model further uncovers how different instructional modes influence the factors that contribute to student achievement. This insight supports the need for a customized approach to feature selection that aligns with the specific characteristics of each teaching mode. Furthermore, the model highlights the importance of incorporating self-directed learning skills into student-related indicators when predicting academic performance.

  11. p-value of the independent t-test comparing the performance of XGBTree with...

    • plos.figshare.com
    xls
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuan Tran; Uyen Le; Yihui Shi (2023). p-value of the independent t-test comparing the performance of XGBTree with other models using a 95% confidence interval (Note: (*) implies the p-value is much smaller than 0.001). [Dataset]. http://doi.org/10.1371/journal.pone.0269135.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tuan Tran; Uyen Le; Yihui Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    p-value of the independent t-test comparing the performance of XGBTree with other models using a 95% confidence interval (Note: (*) implies the p-value is much smaller than 0.001).

  12. Properties of the office cell building model in various climates.

    • plos.figshare.com
    xls
    Updated Oct 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović (2024). Properties of the office cell building model in various climates. [Dataset]. http://doi.org/10.1371/journal.pone.0312573.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 25, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Properties of the office cell building model in various climates.

  13. f

    Data_Sheet_1_The effect of reading engagement on scientific literacy – an...

    • frontiersin.figshare.com
    docx
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Canxi Cao; Tongxin Zhang; Tao Xin (2024). Data_Sheet_1_The effect of reading engagement on scientific literacy – an analysis based on the XGBoost method.docx [Dataset]. http://doi.org/10.3389/fpsyg.2024.1329724.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Frontiers
    Authors
    Canxi Cao; Tongxin Zhang; Tao Xin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scientific literacy is a key factor of personal competitiveness, and reading is the most common activity in daily learning life, and playing the influence of reading on individuals day by day is the most convenient way to improve the level of scientific literacy of all people. Reading engagement is one of the important student characteristics related to reading literacy, which is highly malleable and is jointly reflected by behavioral, cognitive, and affective engagement, and it is of theoretical and practical significance to explore the relationship between reading engagement and scientific literacy using reading engagement as an entry point. In this study, we used PISA2018 data from China to explore the relationship between reading engagement and scientific literacy with a sample of 15-year-old students in mainland China. 36 variables related to reading engagement and background variables (gender, grade, and socioeconomic and cultural status of the family) were selected from the questionnaire as the independent variables, and the score of the Scientific Literacy Assessment (SLA) was taken as the outcome variable, and supervised machine learning method, the XGBoost algorithm, to construct the model. The dataset is randomly divided into training set and test set to optimize the model, which can verify that the obtained model has good fitting degree and generalization ability. Meanwhile, global and local personalized interpretation is done by introducing the SHAP value, a cutting-edge machine model interpretation method. It is found that among the three major components of reading engagement, cognitive engagement is the more influential factor, and students with high reading cognitive engagement level are more likely to get high scores in scientific literacy assessment, which is relatively dominant in the model of this study. On the other hand, this study verifies the feasibility of the current popular machine learning model, i.e., XGBoost, in a large-scale international education assessment program, with a better model adaptability and conditions for global and local interpretation.

  14. Optimal hyperparameters of the models.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuan Tran; Uyen Le; Yihui Shi (2023). Optimal hyperparameters of the models. [Dataset]. http://doi.org/10.1371/journal.pone.0269135.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Tuan Tran; Uyen Le; Yihui Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optimal hyperparameters of the models.

  15. The city crash data feature and variable.

    • plos.figshare.com
    xls
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulaziz H. Alshehri; Fayez Alanazi; Ahmed. M. Yosri; Muhammad Yasir (2024). The city crash data feature and variable. [Dataset]. http://doi.org/10.1371/journal.pone.0302171.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Abdulaziz H. Alshehri; Fayez Alanazi; Ahmed. M. Yosri; Muhammad Yasir
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study aims to use machine learning methods to examine the causative factors of significant crashes, focusing on accident type and driver’s age. In this study, a wide-ranging data set from Jeddah city is employed to look into various factors, such as whether the driver was male or female, where the vehicle was situated, the prevailing weather conditions, and the efficiency of four machine learning algorithms, specifically XGBoost, Catboost, LightGBM and RandomForest. The results show that the XGBoost Model (accuracy of 95.4%), the CatBoost model (94% accuracy), and the LightGBM model (94.9% accuracy) were superior to the random forest model with 89.1% accuracy. It is worth noting that the LightGBM had the highest accuracy of all models. This shows various subtle changes in models, illustrating the need for more analyses while assessing vehicle accidents. Machine learning is also a transforming tool in traffic safety analysis while providing vital guidelines for developing accurate traffic safety regulations.

  16. How I Built a Smart Energy Forecast Using XGBoost

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrii Siryi (2025). How I Built a Smart Energy Forecast Using XGBoost [Dataset]. https://www.kaggle.com/datasets/asiryi/how-i-built-a-smart-energy-forecast-using-xgboost
    Explore at:
    zip(771443 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Andrii Siryi
    Description

    Over the last few days, I’ve been experimenting with an idea I’ve wanted to build for a long time — a small, intelligent energy-forecasting system that learns from IoT data and predicts electricity consumption in real time.

    The goal was simple: teach a model to understand how a household’s energy use changes with time, activity, and weather — and then visualize what the next 24 hours might look like.

    Here’s what the journey looked like:

    Step 1 Simulating Real IoT and Weather Data

    To start, I created realistic datasets for both IoT sensors and weather conditions.

    simulate_iot.py generated hourly energy readings (kWh) based on typical daily patterns — more usage in the evenings, less at night.

    simulate_weather.py produced temperature, humidity, and precipitation data for the same 60-day period.

    These two datasets became the foundation of the system — one describing human activity, the other representing environmental influence.

    Step 2 Feature Engineering

    The next piece was features.py, which merged both datasets into a single training set. Here the goal was to create features that the model could actually learn from:

    Lag features (kwh_lag_1, kwh_lag_24) to capture short-term and daily patterns.

    Rolling averages to smooth out fluctuations.

    Weather fields (outside_temp, humidity) to model environmental impact.

    This step is where raw data turns into usable intelligence.

    Step 3 Training the Model

    Using train.py, I trained an XGBoost regression model on 60 days of data. The model learned to predict energy usage for each hour based on:

    • time of day,
    • day of week,
    • weather conditions,
    • number of occupants,
    • HVAC activity,
    • and the previous energy history.

    After training, the model’s performance looked solid — MAE ≈ 0.07, RMSE ≈ 0.09, and MAPE around 10-15%. Pretty good for a simulated environment!

    Step 4 Forecasting and Visualization

    Once the model was trained, I moved to the fun part: visualizing the predictions.

    Using Plotly, I built forecast_plotly.py, which generates an interactive dashboard. It displays two parts:

    • The real energy consumption for the last 48 hours (blue line).
    • The forecasted energy usage for the next 24 hours (orange dashed line).

    A gray vertical line separates “past” from “future”, making the forecast transition crystal clear. You can zoom in, hover over points to see values, and even export the chart as HTML.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2F737ffdc559572f89ca052288eefff9d3%2FFigure_2.jpg?generation=1761500356480488&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2Fe8fcadeb81f91ccc8296ad9354d98341%2FFigure_1.jpg?generation=1761500385120651&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2F50c5677c0c5b4305f0851b2176dd09b1%2FFigure_3.jpg?generation=1761500399269219&alt=media" alt="">

    Project Structure The project is organized cleanly to keep everything modular and easy to maintain. D:\AI Models\energy-saver │ ├── data
    │ ├── iot_simulated.csv # Simulated IoT energy readings │ ├── weather.csv # Simulated weather data │ ├── train_dataset.csv # Merged dataset used for training │ ├── models
    │ └── xgb_kwh.joblib # Trained XGBoost model │ ├── src
    │ ├── simulate_iot.py # IoT data generator │ ├── simulate_weather.py # Weather data generator │ ├── features.py # Merging and feature creation │ ├── train.py # Model training and evaluation │ ├── forecast_plotly.py # Interactive visualization (Plotly) │ ├── venv\ # Virtual environment │ ├── README.md # Project documentation └── forecast_interactive.html # Saved interactive dashboard

    The final result is a small yet complete prototype of a smart energy management system. With a few adjustments (real IoT data, a weather API, and live retraining), this same setup could power a real “AI-based home energy advisor.”

    It doesn’t just predict — it can help decide when it’s cheaper or smarter to use energy, saving both cost and power.

    Reflection

    This project turned out to be an amazing hands-on way to combine data simulation, feature engineering, model training, and visualization in one workflow.

    Every part of it has a clear role:

    • simulate_iot.py → creates realistic energy signals
    • simulate_weather.py → adds environmental context
    • features.py → merges and engineers predictors
    • train.py → builds and evaluates the model
    • forecast_plotly.py → brings everything to life visually
  17. Drug Response Prediction Dataset

    • kaggle.com
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vahid Kazemian (2023). Drug Response Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/vahidkazemian/introds
    Explore at:
    zip(2564 bytes)Available download formats
    Dataset updated
    Dec 16, 2023
    Authors
    Vahid Kazemian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains clinical and demographic patient data used to predict the most suitable drug prescription (multiclass classification task). It's designed for developing machine learning models that assist in personalized medicine.

    Origin & Purpose: Source: Synthetic/benchmark dataset (commonly used in ML courses) Goal: Predict one of 5 drugs (A, B, C, X, or Y) based on patient metrics Size: 200 patient records

    **Notable Characteristics: **

    Class Imbalance: drugY = 39.5% (most frequent) drugA/B/C = 10-11% each

    Clinical Relevance: Blood pressure (BP) and Cholesterol levels heavily influence drug choice Electrolytes (Na/K) show non-linear relationships with outcomes

    Use Cases: Multiclass classification practice Feature importance analysis (e.g., "Does age or BP matter more?") Medical decision-support prototyping

    Sample Insight: Patients with LOW BP and HIGH Cholesterol are often prescribed drugC, while those with NORMAL vitals typically receive drugX or drugY.

    Ideal For: Logistic Regression Random Forests Gradient Boosting (XGBoost/CatBoost) Neural networks for tabular data

  18. Fake News Detection Dataset

    • kaggle.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
    Explore at:
    zip(11735585 bytes)Available download formats
    Dataset updated
    Apr 27, 2025
    Authors
    Mahdi Mashayekhi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📚 Fake News Detection Dataset

    Overview

    This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

    The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

    Columns Description

    title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

    Why Use This Dataset?

    Fake News Detection Practice: Perfect for binary classification tasks.

    NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

    Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

    Feature Engineering: Encourages creating new features from text and metadata.

    Balanced Labels: Realistic distribution of real and fake news for fair model training.

    Potential Use Cases

    Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

    Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

    Performing exploratory data analysis (EDA) on news data.

    Developing pipelines for dealing with missing values and feature extraction.

    A Note on Data Quality

    This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

    File Info

    Filename: fake_news_dataset.csv

    Size: 20,000 rows × 7 columns

    Missing Data: ~5% missing values in the source and author columns.

  19. f

    Data from: Strategy of Coupling Artificial Intelligence with Thermodynamic...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siqi Wang; Gabriele Sadowski; Yuanhui Ji (2024). Strategy of Coupling Artificial Intelligence with Thermodynamic Mechanism for Predicting Complex Polymer Viscosities [Dataset]. http://doi.org/10.1021/acssuschemeng.3c08185.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    ACS Publications
    Authors
    Siqi Wang; Gabriele Sadowski; Yuanhui Ji
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    With the environmental protection requirements brought about by the large-scale application of polymers in industrial fields, understanding the viscosities of polymers is becoming increasingly important. The different arrangements and crystallinity of the polymers make their viscosities difficult to calculate. To address this challenge, new strategies based on artificial intelligence algorithms are proposed. First, the strategy trains three artificial intelligence algorithms [extreme gradient boosting (XGBoost), convolutional neural network (CNN), and multilayer perceptron (MLP)] based on molecular descriptors of the polymer molecular properties. Next, the PC-SAFT parameters are input into the XGBoost and CNN algorithms as molecular descriptors representing the thermodynamic properties of the polymer to improve the accuracy of the algorithm prediction results. Subsequently, the Molecular ACCess Systems chemical fingerprinting was combined with the XGboost algorithm and CNN algorithm to further improve the accuracy of predicting viscosities. The XGboost algorithm was identified as the best predictive algorithm for predicting the viscosities of the polymer in different states. This discovery is expected to provide effective information for screening polymers for applications in medicine and the chemical industry.

  20. Sustainable Luxury Consumer Survey

    • kaggle.com
    zip
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saima Khan6 (2025). Sustainable Luxury Consumer Survey [Dataset]. https://www.kaggle.com/datasets/saimakhan6/sustainable-luxury-consumer-survey
    Explore at:
    zip(102395 bytes)Available download formats
    Dataset updated
    Nov 6, 2025
    Authors
    Saima Khan6
    Description

    📖 Description This dataset was designed to explore how psychological traits, attitudinal measures, and motivational drivers influence sustainable luxury purchase intention.It combines quantitative responses from 500 consumers on multiple validated scales — including sustainability attitudes, purchase intention, personality (Big Five), and motivational factors.

    The dataset was collected as part of an academic-industry research project on sustainable luxury consumption and consumer psychology. It aims to bridge the gap between marketing theory and predictive analytics by providing a structured, research-grade dataset suitable for both statistical and machine learning modeling. 🎯 Business & Research Use Cases • Predict purchase intention for sustainable luxury products. • Segment consumers based on eco-conscious attitudes and personality traits. • Build marketing analytics models that link sustainability values with buying behavior. • Use for teaching and demonstration in data-driven marketing, consumer analytics, or ethical branding.

    🎯 Business & Research Use Cases • Predict purchase intention for sustainable luxury products. • Segment consumers based on eco-conscious attitudes and personality traits. • Build marketing analytics models that link sustainability values with buying behavior. • Use for teaching and demonstration in data-driven marketing, consumer analytics, or ethical branding.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002
Organization logo

Hyperparameters for the XGBoost model.

Related Article
Explore at:
299 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Nov 27, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.

Search
Clear search
Close search
Google apps
Main menu