Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.
Facebook
TwitterBackgroundAdvances in Next Generation Sequencing have made rapid variant discovery and detection widely accessible. To facilitate a better understanding of the nature of these variants, American College of Medical Genetics and Genomics and the Association of Molecular Pathologists (ACMG-AMP) have issued a set of guidelines for variant classification. However, given the vast number of variants associated with any disorder, it is impossible to manually apply these guidelines to all known variants. Machine learning methodologies offer a rapid way to classify large numbers of variants, as well as variants of uncertain significance as either pathogenic or benign. Here we classify ATP7B genetic variants by employing ML and AI algorithms trained on our well-annotated WilsonGen dataset.MethodsWe have trained and validated two algorithms: TabNet and XGBoost on a high-confidence dataset of manually annotated, ACMG & AMP classified variants of the ATP7B gene associated with Wilson’s Disease.ResultsUsing an independent validation dataset of ACMG & AMP classified variants, as well as a patient set of functionally validated variants, we showed how both algorithms perform and can be used to classify large numbers of variants in clinical as well as research settings.ConclusionWe have created a ready to deploy tool, that can classify variants linked with Wilson’s disease as pathogenic or benign, which can be utilized by both clinicians and researchers to better understand the disease through the nature of genetic variants associated with it.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper presents the first national-scale Multi-Attribute Building dataset (CMAB) with artificial intelligence, covering 3,667 spatial cities, 31 million buildings, and 23.6 billion m² of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 363 billion m³ of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating morphology, location, and function features. Using multi-source data, including billions of remote sensing images and 60 million street view images (SVIs), we generated rooftop, height, structure, function, style, age, and quality attributes for each building with machine learning and large multimodal models. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.Data records: A building dataset with a total rooftop area of 23.6 billion square meters in 3,667 natural cities in China, including the attribute of building rooftop, height, structure, function, age, style and quality, as well as the code files used to calculate these data. The deep learning models used are OCRNet, XGBoost, fine-tuned CLIP and Yolo-v8.Supplementary note: The architectural structure, style, and quality are affected by the temporal and spatial distribution of street views in China. Regarding the recognition of building colors, we found that the existing CLIP series model can not accurately judge the composition and proportion of building colors, and then it will be accurately calculated and supplemented by semantic segmentation and image processing. Please contact zhangyec23@mails.tsinghua.edu.cn or ylong@tsinghua.edu.cn if you have any technical problems.Reference Format: Zhang, Y., Zhao, H. & Long, Y. CMAB: A Multi-Attribute Building Dataset of China. Sci Data 12, 430 (2025). https://doi.org/10.1038/s41597-025-04730-5.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains historical information of 5 years to help predict electricity demand using machine learning, especially with models like XGBoost. It includes features such as temperature, humidity, wind speed, and past electricity usage across different time intervals.
The dataset is designed to help you learn and build models that can forecast how much electricity people might use in the future. This is useful for energy companies, smart grids, and power management systems.
The Features/Columns available in the dataset are :
Potential Use Cases :
-Build regression models to forecast electricity demand
-Use lag and rolling features in time series models
-Compare performance of ML algorithms like XGBoost, Random Forest, and LSTM
-Learn how environmental and time-based factors affect electricity usage
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionAn easily accessible and cost-free machine learning model based on prior probabilities of vascular aging enables an application to pinpoint high-risk populations before physical checks and optimize healthcare investment.MethodsA dataset containing questionnaire responses and physical measurement parameters from 77,134 adults was extracted from the electronic records of the Health Management Center at the Third Xiangya Hospital. The least absolute shrinkage and selection operator and recursive feature elimination-Lightweight Gradient Elevator were employed to select features from a pool of potential covariates. The participants were randomly divided into training (70%) and test cohorts (30%). Four machine learning algorithms were applied to build the screening models for elevated arterial stiffness (EAS), and the performance of models was evaluated by calculating the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy.ResultsFourteen easily accessible features were selected to construct the model, including “systolic blood pressure” (SBP), “age,” “waist circumference,” “history of hypertension,” “sex,” “exercise,” “awareness of normal blood pressure,” “eat fruit,” “work intensity,” “drink milk,” “eat bean products,” “smoking,” “alcohol consumption,” and “Irritableness.” The extreme gradient boosting (XGBoost) model outperformed the other three models, achieving AUC values of 0.8722 and 0.8710 in the training and test sets, respectively. The most important five features are SBP, age, waist, history of hypertension, and sex.ConclusionThe XGBoost model ideally assesses the prior probability of the current EAS in the general population. The integration of the model into primary care facilities has the potential to lower medical expenses and enhance the management of arterial aging.
Facebook
TwitterMicrosoft Excel
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.
| Column Name | Description |
|---|---|
| Transaction_ID | Unique identifier for each transaction |
| User_ID | Unique identifier for the user |
| Transaction_Amount | Amount of money involved in the transaction |
| Transaction_Type | Type of transaction (Online, In-Store, ATM, etc.) |
| Timestamp | Date and time of the transaction |
| Account_Balance | User's current account balance before the transaction |
| Device_Type | Type of device used (Mobile, Desktop, etc.) |
| Location | Geographical location of the transaction |
| Merchant_Category | Type of merchant (Retail, Food, Travel, etc.) |
| IP_Address_Flag | Whether the IP address was flagged as suspicious (0 or 1) |
| Previous_Fraudulent_Activity | Number of past fraudulent activities by the user |
| Daily_Transaction_Count | Number of transactions made by the user that day |
| Avg_Transaction_Amount_7d | User's average transaction amount in the past 7 days |
| Failed_Transaction_Count_7d | Count of failed transactions in the past 7 days |
| Card_Type | Type of payment card used (Credit, Debit, Prepaid, etc.) |
| Card_Age | Age of the card in months |
| Transaction_Distance | Distance between the user's usual location and transaction location |
| Authentication_Method | How the user authenticated (PIN, Biometric, etc.) |
| Risk_Score | Fraud risk score computed for the transaction |
| Is_Weekend | Whether the transaction occurred on a weekend (0 or 1) |
| Fraud_Label | Target variable (0 = Not Fraud, 1 = Fraud) |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Surrogate optimisation holds a big promise for building energy optimisation studies due to its goal to replace the use of lengthy building energy simulations within an optimisation step with expendable local surrogate models that can quickly predict simulation results. To be useful for such purpose, it should be possible to quickly train precise surrogate models from a small number of simulation results (10–100) obtained from appropriately sampled points in the desired part of the design space. Two sampling methods and two machine learning models are compared here. Latin hypercube sampling (LHS), widely accepted in building energy community, is compared to an exploratory Monte Carlo-based sequential design method mc-intersite-proj-th (MIPT). Artificial neural networks (ANN), also widely accepted in building energy community, are compared to gradient-boosted tree ensembles (XGBoost), model of choice in many machine learning competitions. In order to get a better understanding of the behaviour of these two sampling methods and two machine learning models, we compare their predictions against a large set of generated synthetic data. For this purpose, a simple case study of an office cell model with a single window and a fixed overhang, whose main input parameters are overhang depth and height, while climate type, presence of obstacles, orientation and heating and cooling set points are additional input parameters, was extensively simulated with EnergyPlus, to form a large underlying dataset of 729,000 simulation results. Expendable local surrogate models for predicting simulated heating, cooling and lighting loads and equivalent primary energy needs of the office cell were trained using both LHS and MIPT and both ANN and XGBoost for several main hyperparameter choices. Results show that XGBoost models are more precise than ANN models, and that for both machine learning models, the use of MIPT sampling leads to more precise surrogates than LHS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Student academic achievement is an important indicator for evaluating the quality of education, especially, the achievement prediction empowers educators in tailoring their instructional approaches, thereby fostering advancements in both student performance and the overall educational quality. However, extracting valuable insights from vast educational data to develop effective strategies for evaluating student performance remains a significant challenge for higher education institutions. Traditional machine learning (ML) algorithms often struggle to clearly delineate the interplay between the factors that influence academic success and the resulting grades. To address these challenges, this paper introduces the XGB-SHAP model, a novel approach for predicting student achievement that combines Extreme Gradient Boosting (XGBoost) with SHapley Additive exPlanations (SHAP). The model was applied to a dataset from a public university in Wuhan, encompassing the academic records of 87 students who were enrolled in a Japanese course between September 2021 and June 2023. The findings indicate the model excels in accuracy, achieving a Mean absolute error (MAE) of approximately 6 and an R-squared value near 0.82, surpassing three other ML models. The model further uncovers how different instructional modes influence the factors that contribute to student achievement. This insight supports the need for a customized approach to feature selection that aligns with the specific characteristics of each teaching mode. Furthermore, the model highlights the importance of incorporating self-directed learning skills into student-related indicators when predicting academic performance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
p-value of the independent t-test comparing the performance of XGBTree with other models using a 95% confidence interval (Note: (*) implies the p-value is much smaller than 0.001).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Properties of the office cell building model in various climates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scientific literacy is a key factor of personal competitiveness, and reading is the most common activity in daily learning life, and playing the influence of reading on individuals day by day is the most convenient way to improve the level of scientific literacy of all people. Reading engagement is one of the important student characteristics related to reading literacy, which is highly malleable and is jointly reflected by behavioral, cognitive, and affective engagement, and it is of theoretical and practical significance to explore the relationship between reading engagement and scientific literacy using reading engagement as an entry point. In this study, we used PISA2018 data from China to explore the relationship between reading engagement and scientific literacy with a sample of 15-year-old students in mainland China. 36 variables related to reading engagement and background variables (gender, grade, and socioeconomic and cultural status of the family) were selected from the questionnaire as the independent variables, and the score of the Scientific Literacy Assessment (SLA) was taken as the outcome variable, and supervised machine learning method, the XGBoost algorithm, to construct the model. The dataset is randomly divided into training set and test set to optimize the model, which can verify that the obtained model has good fitting degree and generalization ability. Meanwhile, global and local personalized interpretation is done by introducing the SHAP value, a cutting-edge machine model interpretation method. It is found that among the three major components of reading engagement, cognitive engagement is the more influential factor, and students with high reading cognitive engagement level are more likely to get high scores in scientific literacy assessment, which is relatively dominant in the model of this study. On the other hand, this study verifies the feasibility of the current popular machine learning model, i.e., XGBoost, in a large-scale international education assessment program, with a better model adaptability and conditions for global and local interpretation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optimal hyperparameters of the models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study aims to use machine learning methods to examine the causative factors of significant crashes, focusing on accident type and driver’s age. In this study, a wide-ranging data set from Jeddah city is employed to look into various factors, such as whether the driver was male or female, where the vehicle was situated, the prevailing weather conditions, and the efficiency of four machine learning algorithms, specifically XGBoost, Catboost, LightGBM and RandomForest. The results show that the XGBoost Model (accuracy of 95.4%), the CatBoost model (94% accuracy), and the LightGBM model (94.9% accuracy) were superior to the random forest model with 89.1% accuracy. It is worth noting that the LightGBM had the highest accuracy of all models. This shows various subtle changes in models, illustrating the need for more analyses while assessing vehicle accidents. Machine learning is also a transforming tool in traffic safety analysis while providing vital guidelines for developing accurate traffic safety regulations.
Facebook
TwitterOver the last few days, I’ve been experimenting with an idea I’ve wanted to build for a long time — a small, intelligent energy-forecasting system that learns from IoT data and predicts electricity consumption in real time.
The goal was simple: teach a model to understand how a household’s energy use changes with time, activity, and weather — and then visualize what the next 24 hours might look like.
Here’s what the journey looked like:
Step 1 Simulating Real IoT and Weather Data
To start, I created realistic datasets for both IoT sensors and weather conditions.
simulate_iot.py generated hourly energy readings (kWh) based on typical daily patterns — more usage in the evenings, less at night.
simulate_weather.py produced temperature, humidity, and precipitation data for the same 60-day period.
These two datasets became the foundation of the system — one describing human activity, the other representing environmental influence.
Step 2 Feature Engineering
The next piece was features.py, which merged both datasets into a single training set. Here the goal was to create features that the model could actually learn from:
Lag features (kwh_lag_1, kwh_lag_24) to capture short-term and daily patterns.
Rolling averages to smooth out fluctuations.
Weather fields (outside_temp, humidity) to model environmental impact.
This step is where raw data turns into usable intelligence.
Step 3 Training the Model
Using train.py, I trained an XGBoost regression model on 60 days of data. The model learned to predict energy usage for each hour based on:
After training, the model’s performance looked solid — MAE ≈ 0.07, RMSE ≈ 0.09, and MAPE around 10-15%. Pretty good for a simulated environment!
Step 4 Forecasting and Visualization
Once the model was trained, I moved to the fun part: visualizing the predictions.
Using Plotly, I built forecast_plotly.py, which generates an interactive dashboard. It displays two parts:
A gray vertical line separates “past” from “future”, making the forecast transition crystal clear. You can zoom in, hover over points to see values, and even export the chart as HTML.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2F737ffdc559572f89ca052288eefff9d3%2FFigure_2.jpg?generation=1761500356480488&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2Fe8fcadeb81f91ccc8296ad9354d98341%2FFigure_1.jpg?generation=1761500385120651&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2F50c5677c0c5b4305f0851b2176dd09b1%2FFigure_3.jpg?generation=1761500399269219&alt=media" alt="">
Project Structure
The project is organized cleanly to keep everything modular and easy to maintain.
D:\AI Models\energy-saver
│
├── data
│ ├── iot_simulated.csv # Simulated IoT energy readings
│ ├── weather.csv # Simulated weather data
│ ├── train_dataset.csv # Merged dataset used for training
│
├── models
│ └── xgb_kwh.joblib # Trained XGBoost model
│
├── src
│ ├── simulate_iot.py # IoT data generator
│ ├── simulate_weather.py # Weather data generator
│ ├── features.py # Merging and feature creation
│ ├── train.py # Model training and evaluation
│ ├── forecast_plotly.py # Interactive visualization (Plotly)
│
├── venv\ # Virtual environment
│
├── README.md # Project documentation
└── forecast_interactive.html # Saved interactive dashboard
The final result is a small yet complete prototype of a smart energy management system. With a few adjustments (real IoT data, a weather API, and live retraining), this same setup could power a real “AI-based home energy advisor.”
It doesn’t just predict — it can help decide when it’s cheaper or smarter to use energy, saving both cost and power.
Reflection
This project turned out to be an amazing hands-on way to combine data simulation, feature engineering, model training, and visualization in one workflow.
Every part of it has a clear role:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains clinical and demographic patient data used to predict the most suitable drug prescription (multiclass classification task). It's designed for developing machine learning models that assist in personalized medicine.
Origin & Purpose: Source: Synthetic/benchmark dataset (commonly used in ML courses) Goal: Predict one of 5 drugs (A, B, C, X, or Y) based on patient metrics Size: 200 patient records
**Notable Characteristics: **
Class Imbalance: drugY = 39.5% (most frequent) drugA/B/C = 10-11% each
Clinical Relevance: Blood pressure (BP) and Cholesterol levels heavily influence drug choice Electrolytes (Na/K) show non-linear relationships with outcomes
Use Cases: Multiclass classification practice Feature importance analysis (e.g., "Does age or BP matter more?") Medical decision-support prototyping
Sample Insight: Patients with LOW BP and HIGH Cholesterol are often prescribed drugC, while those with NORMAL vitals typically receive drugX or drugY.
Ideal For: Logistic Regression Random Forests Gradient Boosting (XGBoost/CatBoost) Neural networks for tabular data
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.
The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.
title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.
Fake News Detection Practice: Perfect for binary classification tasks.
NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.
Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.
Feature Engineering: Encourages creating new features from text and metadata.
Balanced Labels: Realistic distribution of real and fake news for fair model training.
Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).
Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.
Performing exploratory data analysis (EDA) on news data.
Developing pipelines for dealing with missing values and feature extraction.
This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.
Filename: fake_news_dataset.csv
Size: 20,000 rows × 7 columns
Missing Data: ~5% missing values in the source and author columns.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
With the environmental protection requirements brought about by the large-scale application of polymers in industrial fields, understanding the viscosities of polymers is becoming increasingly important. The different arrangements and crystallinity of the polymers make their viscosities difficult to calculate. To address this challenge, new strategies based on artificial intelligence algorithms are proposed. First, the strategy trains three artificial intelligence algorithms [extreme gradient boosting (XGBoost), convolutional neural network (CNN), and multilayer perceptron (MLP)] based on molecular descriptors of the polymer molecular properties. Next, the PC-SAFT parameters are input into the XGBoost and CNN algorithms as molecular descriptors representing the thermodynamic properties of the polymer to improve the accuracy of the algorithm prediction results. Subsequently, the Molecular ACCess Systems chemical fingerprinting was combined with the XGboost algorithm and CNN algorithm to further improve the accuracy of predicting viscosities. The XGboost algorithm was identified as the best predictive algorithm for predicting the viscosities of the polymer in different states. This discovery is expected to provide effective information for screening polymers for applications in medicine and the chemical industry.
Facebook
Twitter📖 Description This dataset was designed to explore how psychological traits, attitudinal measures, and motivational drivers influence sustainable luxury purchase intention.It combines quantitative responses from 500 consumers on multiple validated scales — including sustainability attitudes, purchase intention, personality (Big Five), and motivational factors.
The dataset was collected as part of an academic-industry research project on sustainable luxury consumption and consumer psychology. It aims to bridge the gap between marketing theory and predictive analytics by providing a structured, research-grade dataset suitable for both statistical and machine learning modeling. 🎯 Business & Research Use Cases • Predict purchase intention for sustainable luxury products. • Segment consumers based on eco-conscious attitudes and personality traits. • Build marketing analytics models that link sustainability values with buying behavior. • Use for teaching and demonstration in data-driven marketing, consumer analytics, or ethical branding.
🎯 Business & Research Use Cases • Predict purchase intention for sustainable luxury products. • Segment consumers based on eco-conscious attitudes and personality traits. • Build marketing analytics models that link sustainability values with buying behavior. • Use for teaching and demonstration in data-driven marketing, consumer analytics, or ethical branding.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.