31 datasets found

Hyperparameters for the XGBoost model.
plos.figshare.com
xls
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312531.t002
Dataset updated
Nov 27, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.
f
Model hyperparameters used for the XGBoost model.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saikia, Bhaskar Jyoti; Scaria, Vinod; Kumar, Mukesh; K. , Binukumar B.; Vatsyayan, Aastha (2024). Model hyperparameters used for the XGBoost model. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001281231
Explore at:
Dataset updated
May 17, 2024
Authors
Saikia, Bhaskar Jyoti; Scaria, Vinod; Kumar, Mukesh; K. , Binukumar B.; Vatsyayan, Aastha
Description
BackgroundAdvances in Next Generation Sequencing have made rapid variant discovery and detection widely accessible. To facilitate a better understanding of the nature of these variants, American College of Medical Genetics and Genomics and the Association of Molecular Pathologists (ACMG-AMP) have issued a set of guidelines for variant classification. However, given the vast number of variants associated with any disorder, it is impossible to manually apply these guidelines to all known variants. Machine learning methodologies offer a rapid way to classify large numbers of variants, as well as variants of uncertain significance as either pathogenic or benign. Here we classify ATP7B genetic variants by employing ML and AI algorithms trained on our well-annotated WilsonGen dataset.MethodsWe have trained and validated two algorithms: TabNet and XGBoost on a high-confidence dataset of manually annotated, ACMG & AMP classified variants of the ATP7B gene associated with Wilson’s Disease.ResultsUsing an independent validation dataset of ACMG & AMP classified variants, as well as a patient set of functionally validated variants, we showed how both algorithms perform and can be used to classify large numbers of variants in clinical as well as research settings.ConclusionWe have created a ready to deploy tool, that can classify variants linked with Wilson’s disease as pathogenic or benign, which can be utilized by both clinicians and researchers to better understand the disease through the nature of genetic variants associated with it.
f
CMAB-The World's First National-Scale Multi-Attribute Building Dataset
figshare.com
bin
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yecheng Zhang; Huimin Zhao; Ying Long (2025). CMAB-The World's First National-Scale Multi-Attribute Building Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.27992417.v7
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27992417.v7
Dataset updated
Apr 20, 2025
Dataset provided by
figshare
Authors
Yecheng Zhang; Huimin Zhao; Ying Long
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper presents the first national-scale Multi-Attribute Building dataset (CMAB) with artificial intelligence, covering 3,667 spatial cities, 31 million buildings, and 23.6 billion m² of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 363 billion m³ of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating morphology, location, and function features. Using multi-source data, including billions of remote sensing images and 60 million street view images (SVIs), we generated rooftop, height, structure, function, style, age, and quality attributes for each building with machine learning and large multimodal models. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.Data records: A building dataset with a total rooftop area of 23.6 billion square meters in 3,667 natural cities in China, including the attribute of building rooftop, height, structure, function, age, style and quality, as well as the code files used to calculate these data. The deep learning models used are OCRNet, XGBoost, fine-tuned CLIP and Yolo-v8.Supplementary note: The architectural structure, style, and quality are affected by the temporal and spatial distribution of street views in China. Regarding the recognition of building colors, we found that the existing CLIP series model can not accurately judge the composition and proportion of building colors, and then it will be accurately calculated and supplemented by semantic segmentation and image processing. Please contact zhangyec23@mails.tsinghua.edu.cn or ylong@tsinghua.edu.cn if you have any technical problems.Reference Format: Zhang, Y., Zhao, H. & Long, Y. CMAB: A Multi-Attribute Building Dataset of China. Sci Data 12, 430 (2025). https://doi.org/10.1038/s41597-025-04730-5.
f
Data from: Extreme Gradient Boosting as a Method for Quantitative...
acs.figshare.com
figshare.com
zip
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford (2023). Extreme Gradient Boosting as a Method for Quantitative Structure–Activity Relationships [Dataset]. http://doi.org/10.1021/acs.jcim.6b00591.s033
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.6b00591.s033
Dataset updated
Jun 6, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan; Wei Min Wang; Andy Liaw; Junshui Ma; Eric M. Gifford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
Electricity Demand Historical Data
kaggle.com
zip
Updated Jul 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science Lovers (2025). Electricity Demand Historical Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/electricity-demand-data-dsl
Explore at:
zip(968020 bytes)Available download formats
Dataset updated
Jul 26, 2025
Authors
Data Science Lovers
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
📹 Project Video available on YouTube - https://youtu.be/iop8TUxmgO0

🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

Electricity Demand Forecasting Dataset (XGBoost Model Ready)

This dataset contains historical information of 5 years to help predict electricity demand using machine learning, especially with models like XGBoost. It includes features such as temperature, humidity, wind speed, and past electricity usage across different time intervals.

The dataset is designed to help you learn and build models that can forecast how much electricity people might use in the future. This is useful for energy companies, smart grids, and power management systems.

The Features/Columns available in the dataset are :

Timestamp: The date of the observation

Demand: Actual electricity demand at that time (target variable)

Temperature: Temperature in degrees Celsius

Humidity: Humidity percentage

Hour: Hour of the day (0–23)

DayOfWeek: Day of the week (0 = Monday, 6 = Sunday)

Month: Month number (1 = January, 12 = December)

Year: Year of the observation

Potential Use Cases :

-Build regression models to forecast electricity demand

-Use lag and rolling features in time series models

-Compare performance of ML algorithms like XGBoost, Random Forest, and LSTM

-Learn how environmental and time-based factors affect electricity usage
Table_1_A cost-effective, machine learning-driven approach for screening...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
doc
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rujia Miao; Qian Dong; Xuelian Liu; Yingying Chen; Jiangang Wang; Jianwen Chen (2024). Table_1_A cost-effective, machine learning-driven approach for screening arterial functional aging in a large-scale Chinese population.DOC [Dataset]. http://doi.org/10.3389/fpubh.2024.1365479.s002
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2024.1365479.s002
Dataset updated
Mar 20, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Rujia Miao; Qian Dong; Xuelian Liu; Yingying Chen; Jiangang Wang; Jianwen Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionAn easily accessible and cost-free machine learning model based on prior probabilities of vascular aging enables an application to pinpoint high-risk populations before physical checks and optimize healthcare investment.MethodsA dataset containing questionnaire responses and physical measurement parameters from 77,134 adults was extracted from the electronic records of the Health Management Center at the Third Xiangya Hospital. The least absolute shrinkage and selection operator and recursive feature elimination-Lightweight Gradient Elevator were employed to select features from a pool of potential covariates. The participants were randomly divided into training (70%) and test cohorts (30%). Four machine learning algorithms were applied to build the screening models for elevated arterial stiffness (EAS), and the performance of models was evaluated by calculating the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and accuracy.ResultsFourteen easily accessible features were selected to construct the model, including “systolic blood pressure” (SBP), “age,” “waist circumference,” “history of hypertension,” “sex,” “exercise,” “awareness of normal blood pressure,” “eat fruit,” “work intensity,” “drink milk,” “eat bean products,” “smoking,” “alcohol consumption,” and “Irritableness.” The extreme gradient boosting (XGBoost) model outperformed the other three models, achieving AUC values of 0.8722 and 0.8710 in the training and test sets, respectively. The most important five features are SBP, age, waist, history of hypertension, and sex.ConclusionThe XGBoost model ideally assesses the prior probability of the current EAS in the general population. The integration of the model into primary care facilities has the potential to lower medical expenses and enhance the management of arterial aging.
d
A machine learning based prediction model for life expectancy
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.z612jm6fv
Dataset updated
Nov 14, 2022
Dataset provided by
Dryad
Authors
Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
Time period covered
Oct 12, 2022
Description
Microsoft Excel

Fraud Detection Transactions Dataset

kaggle.com

zip

Updated Feb 21, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Samay Ashar (2025). Fraud Detection Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/fraud-detection-transactions-dataset

Explore at:

zip(2104444 bytes)Available download formats

Dataset updated

Feb 21, 2025

Authors

Samay Ashar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed to help data scientists and machine learning enthusiasts develop robust fraud detection models. It contains realistic synthetic transaction data, including user information, transaction types, risk scores, and more, making it ideal for binary classification tasks with models like XGBoost and LightGBM.

📌 Key Features

21 features capturing various aspects of a financial transaction
Realistic structure with numerical, categorical, and temporal data
Binary fraud labels (0 = Not Fraud, 1 = Fraud)
Designed for high accuracy with XGBoost and other ML models
Useful for anomaly detection, risk analysis, and security research

📌 Columns in the Dataset

Column Name	Description
Transaction_ID	Unique identifier for each transaction
User_ID	Unique identifier for the user
Transaction_Amount	Amount of money involved in the transaction
Transaction_Type	Type of transaction (`Online`, `In-Store`, `ATM`, etc.)
Timestamp	Date and time of the transaction
Account_Balance	User's current account balance before the transaction
Device_Type	Type of device used (`Mobile`, `Desktop`, etc.)
Location	Geographical location of the transaction
Merchant_Category	Type of merchant (`Retail`, `Food`, `Travel`, etc.)
IP_Address_Flag	Whether the IP address was flagged as suspicious (`0` or `1`)
Previous_Fraudulent_Activity	Number of past fraudulent activities by the user
Daily_Transaction_Count	Number of transactions made by the user that day
Avg_Transaction_Amount_7d	User's average transaction amount in the past 7 days
Failed_Transaction_Count_7d	Count of failed transactions in the past 7 days
Card_Type	Type of payment card used (`Credit`, `Debit`, `Prepaid`, etc.)
Card_Age	Age of the card in months
Transaction_Distance	Distance between the user's usual location and transaction location
Authentication_Method	How the user authenticated (`PIN`, `Biometric`, etc.)
Risk_Score	Fraud risk score computed for the transaction
Is_Weekend	Whether the transaction occurred on a weekend (`0` or `1`)
Fraud_Label	Target variable (`0 = Not Fraud`, `1 = Fraud`)

📌 Potential Use Cases

Fraud detection model training
Anomaly detection in financial transactions
Risk scoring systems for banks and fintech companies
Feature engineering and model explainability research

Surrogate models used in the study.
plos.figshare.com
xls
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović (2024). Surrogate models used in the study. [Dataset]. http://doi.org/10.1371/journal.pone.0312573.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312573.t003
Dataset updated
Oct 25, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Surrogate optimisation holds a big promise for building energy optimisation studies due to its goal to replace the use of lengthy building energy simulations within an optimisation step with expendable local surrogate models that can quickly predict simulation results. To be useful for such purpose, it should be possible to quickly train precise surrogate models from a small number of simulation results (10–100) obtained from appropriately sampled points in the desired part of the design space. Two sampling methods and two machine learning models are compared here. Latin hypercube sampling (LHS), widely accepted in building energy community, is compared to an exploratory Monte Carlo-based sequential design method mc-intersite-proj-th (MIPT). Artificial neural networks (ANN), also widely accepted in building energy community, are compared to gradient-boosted tree ensembles (XGBoost), model of choice in many machine learning competitions. In order to get a better understanding of the behaviour of these two sampling methods and two machine learning models, we compare their predictions against a large set of generated synthetic data. For this purpose, a simple case study of an office cell model with a single window and a fixed overhang, whose main input parameters are overhang depth and height, while climate type, presence of obstacles, orientation and heating and cooling set points are additional input parameters, was extensively simulated with EnergyPlus, to form a large underlying dataset of 729,000 simulation results. Expendable local surrogate models for predicting simulated heating, cooling and lighting loads and equivalent primary energy needs of the office cell were trained using both LHS and MIPT and both ANN and XGBoost for several main hyperparameter choices. Results show that XGBoost models are more precise than ANN models, and that for both machine learning models, the use of MIPT sampling leads to more precise surrogates than LHS.
The pseudo-code for the XGBoost-Shap.
plos.figshare.com
xls
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sixuan Wang; Bin Luo (2024). The pseudo-code for the XGBoost-Shap. [Dataset]. http://doi.org/10.1371/journal.pone.0309838.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309838.t004
Dataset updated
Sep 5, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Sixuan Wang; Bin Luo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Student academic achievement is an important indicator for evaluating the quality of education, especially, the achievement prediction empowers educators in tailoring their instructional approaches, thereby fostering advancements in both student performance and the overall educational quality. However, extracting valuable insights from vast educational data to develop effective strategies for evaluating student performance remains a significant challenge for higher education institutions. Traditional machine learning (ML) algorithms often struggle to clearly delineate the interplay between the factors that influence academic success and the resulting grades. To address these challenges, this paper introduces the XGB-SHAP model, a novel approach for predicting student achievement that combines Extreme Gradient Boosting (XGBoost) with SHapley Additive exPlanations (SHAP). The model was applied to a dataset from a public university in Wuhan, encompassing the academic records of 87 students who were enrolled in a Japanese course between September 2021 and June 2023. The findings indicate the model excels in accuracy, achieving a Mean absolute error (MAE) of approximately 6 and an R-squared value near 0.82, surpassing three other ML models. The model further uncovers how different instructional modes influence the factors that contribute to student achievement. This insight supports the need for a customized approach to feature selection that aligns with the specific characteristics of each teaching mode. Furthermore, the model highlights the importance of incorporating self-directed learning skills into student-related indicators when predicting academic performance.
p-value of the independent t-test comparing the performance of XGBTree with...
plos.figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuan Tran; Uyen Le; Yihui Shi (2023). p-value of the independent t-test comparing the performance of XGBTree with other models using a 95% confidence interval (Note: (*) implies the p-value is much smaller than 0.001). [Dataset]. http://doi.org/10.1371/journal.pone.0269135.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269135.t004
Dataset updated
Jun 15, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Tuan Tran; Uyen Le; Yihui Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
p-value of the independent t-test comparing the performance of XGBTree with other models using a 95% confidence interval (Note: (*) implies the p-value is much smaller than 0.001).
Properties of the office cell building model in various climates.
plos.figshare.com
xls
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović (2024). Properties of the office cell building model in various climates. [Dataset]. http://doi.org/10.1371/journal.pone.0312573.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312573.t001
Dataset updated
Oct 25, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Sanja Stevanović; Husain Dashti; Marko Milošević; Salem Al-Yakoob; Dragan Stevanović
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Properties of the office cell building model in various climates.
f
Data_Sheet_1_The effect of reading engagement on scientific literacy – an...
frontiersin.figshare.com
docx
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Canxi Cao; Tongxin Zhang; Tao Xin (2024). Data_Sheet_1_The effect of reading engagement on scientific literacy – an analysis based on the XGBoost method.docx [Dataset]. http://doi.org/10.3389/fpsyg.2024.1329724.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2024.1329724.s001
Dataset updated
Feb 14, 2024
Dataset provided by
Frontiers
Authors
Canxi Cao; Tongxin Zhang; Tao Xin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scientific literacy is a key factor of personal competitiveness, and reading is the most common activity in daily learning life, and playing the influence of reading on individuals day by day is the most convenient way to improve the level of scientific literacy of all people. Reading engagement is one of the important student characteristics related to reading literacy, which is highly malleable and is jointly reflected by behavioral, cognitive, and affective engagement, and it is of theoretical and practical significance to explore the relationship between reading engagement and scientific literacy using reading engagement as an entry point. In this study, we used PISA2018 data from China to explore the relationship between reading engagement and scientific literacy with a sample of 15-year-old students in mainland China. 36 variables related to reading engagement and background variables (gender, grade, and socioeconomic and cultural status of the family) were selected from the questionnaire as the independent variables, and the score of the Scientific Literacy Assessment (SLA) was taken as the outcome variable, and supervised machine learning method, the XGBoost algorithm, to construct the model. The dataset is randomly divided into training set and test set to optimize the model, which can verify that the obtained model has good fitting degree and generalization ability. Meanwhile, global and local personalized interpretation is done by introducing the SHAP value, a cutting-edge machine model interpretation method. It is found that among the three major components of reading engagement, cognitive engagement is the more influential factor, and students with high reading cognitive engagement level are more likely to get high scores in scientific literacy assessment, which is relatively dominant in the model of this study. On the other hand, this study verifies the feasibility of the current popular machine learning model, i.e., XGBoost, in a large-scale international education assessment program, with a better model adaptability and conditions for global and local interpretation.
Optimal hyperparameters of the models.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuan Tran; Uyen Le; Yihui Shi (2023). Optimal hyperparameters of the models. [Dataset]. http://doi.org/10.1371/journal.pone.0269135.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0269135.t002
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Tuan Tran; Uyen Le; Yihui Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optimal hyperparameters of the models.
The city crash data feature and variable.
plos.figshare.com
xls
Updated May 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdulaziz H. Alshehri; Fayez Alanazi; Ahmed. M. Yosri; Muhammad Yasir (2024). The city crash data feature and variable. [Dataset]. http://doi.org/10.1371/journal.pone.0302171.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302171.t001
Dataset updated
May 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Abdulaziz H. Alshehri; Fayez Alanazi; Ahmed. M. Yosri; Muhammad Yasir
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study aims to use machine learning methods to examine the causative factors of significant crashes, focusing on accident type and driver’s age. In this study, a wide-ranging data set from Jeddah city is employed to look into various factors, such as whether the driver was male or female, where the vehicle was situated, the prevailing weather conditions, and the efficiency of four machine learning algorithms, specifically XGBoost, Catboost, LightGBM and RandomForest. The results show that the XGBoost Model (accuracy of 95.4%), the CatBoost model (94% accuracy), and the LightGBM model (94.9% accuracy) were superior to the random forest model with 89.1% accuracy. It is worth noting that the LightGBM had the highest accuracy of all models. This shows various subtle changes in models, illustrating the need for more analyses while assessing vehicle accidents. Machine learning is also a transforming tool in traffic safety analysis while providing vital guidelines for developing accurate traffic safety regulations.
How I Built a Smart Energy Forecast Using XGBoost
kaggle.com
zip
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrii Siryi (2025). How I Built a Smart Energy Forecast Using XGBoost [Dataset]. https://www.kaggle.com/datasets/asiryi/how-i-built-a-smart-energy-forecast-using-xgboost
Explore at:
zip(771443 bytes)Available download formats
Dataset updated
Oct 26, 2025
Authors
Andrii Siryi
Description
Over the last few days, I’ve been experimenting with an idea I’ve wanted to build for a long time — a small, intelligent energy-forecasting system that learns from IoT data and predicts electricity consumption in real time.

The goal was simple: teach a model to understand how a household’s energy use changes with time, activity, and weather — and then visualize what the next 24 hours might look like.

Here’s what the journey looked like:

Step 1 Simulating Real IoT and Weather Data

To start, I created realistic datasets for both IoT sensors and weather conditions.

simulate_iot.py generated hourly energy readings (kWh) based on typical daily patterns — more usage in the evenings, less at night.

simulate_weather.py produced temperature, humidity, and precipitation data for the same 60-day period.

These two datasets became the foundation of the system — one describing human activity, the other representing environmental influence.

Step 2 Feature Engineering

The next piece was features.py, which merged both datasets into a single training set. Here the goal was to create features that the model could actually learn from:

Lag features (kwh_lag_1, kwh_lag_24) to capture short-term and daily patterns.

Rolling averages to smooth out fluctuations.

Weather fields (outside_temp, humidity) to model environmental impact.

This step is where raw data turns into usable intelligence.

Step 3 Training the Model

Using train.py, I trained an XGBoost regression model on 60 days of data. The model learned to predict energy usage for each hour based on:

time of day,

day of week,

weather conditions,

number of occupants,

HVAC activity,

and the previous energy history.

After training, the model’s performance looked solid — MAE ≈ 0.07, RMSE ≈ 0.09, and MAPE around 10-15%. Pretty good for a simulated environment!

Step 4 Forecasting and Visualization

Once the model was trained, I moved to the fun part: visualizing the predictions.

Using Plotly, I built forecast_plotly.py, which generates an interactive dashboard. It displays two parts:

The real energy consumption for the last 48 hours (blue line).

The forecasted energy usage for the next 24 hours (orange dashed line).

A gray vertical line separates “past” from “future”, making the forecast transition crystal clear. You can zoom in, hover over points to see values, and even export the chart as HTML.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2F737ffdc559572f89ca052288eefff9d3%2FFigure_2.jpg?generation=1761500356480488&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2Fe8fcadeb81f91ccc8296ad9354d98341%2FFigure_1.jpg?generation=1761500385120651&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28808882%2F50c5677c0c5b4305f0851b2176dd09b1%2FFigure_3.jpg?generation=1761500399269219&alt=media" alt="">

Project Structure The project is organized cleanly to keep everything modular and easy to maintain. D:\AI Models\energy-saver │ ├── data
│ ├── iot_simulated.csv # Simulated IoT energy readings │ ├── weather.csv # Simulated weather data │ ├── train_dataset.csv # Merged dataset used for training │ ├── models
│ └── xgb_kwh.joblib # Trained XGBoost model │ ├── src
│ ├── simulate_iot.py # IoT data generator │ ├── simulate_weather.py # Weather data generator │ ├── features.py # Merging and feature creation │ ├── train.py # Model training and evaluation │ ├── forecast_plotly.py # Interactive visualization (Plotly) │ ├── venv\ # Virtual environment │ ├── README.md # Project documentation └── forecast_interactive.html # Saved interactive dashboard

The final result is a small yet complete prototype of a smart energy management system. With a few adjustments (real IoT data, a weather API, and live retraining), this same setup could power a real “AI-based home energy advisor.”

It doesn’t just predict — it can help decide when it’s cheaper or smarter to use energy, saving both cost and power.

Reflection

This project turned out to be an amazing hands-on way to combine data simulation, feature engineering, model training, and visualization in one workflow.

Every part of it has a clear role:

simulate_iot.py → creates realistic energy signals

simulate_weather.py → adds environmental context

features.py → merges and engineers predictors

train.py → builds and evaluates the model

forecast_plotly.py → brings everything to life visually
Drug Response Prediction Dataset
kaggle.com
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vahid Kazemian (2023). Drug Response Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/vahidkazemian/introds
Explore at:
zip(2564 bytes)Available download formats
Dataset updated
Dec 16, 2023
Authors
Vahid Kazemian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains clinical and demographic patient data used to predict the most suitable drug prescription (multiclass classification task). It's designed for developing machine learning models that assist in personalized medicine.

Origin & Purpose: Source: Synthetic/benchmark dataset (commonly used in ML courses) Goal: Predict one of 5 drugs (A, B, C, X, or Y) based on patient metrics Size: 200 patient records

**Notable Characteristics: **

Class Imbalance: drugY = 39.5% (most frequent) drugA/B/C = 10-11% each

Clinical Relevance: Blood pressure (BP) and Cholesterol levels heavily influence drug choice Electrolytes (Na/K) show non-linear relationships with outcomes

Use Cases: Multiclass classification practice Feature importance analysis (e.g., "Does age or BP matter more?") Medical decision-support prototyping

Sample Insight: Patients with LOW BP and HIGH Cholesterol are often prescribed drugC, while those with NORMAL vitals typically receive drugX or drugY.

Ideal For: Logistic Regression Random Forests Gradient Boosting (XGBoost/CatBoost) Neural networks for tabular data
Fake News Detection Dataset
kaggle.com
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
Explore at:
zip(11735585 bytes)Available download formats
Dataset updated
Apr 27, 2025
Authors
Mahdi Mashayekhi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📚 Fake News Detection Dataset

Overview

This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

Columns Description

title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

Why Use This Dataset?

Fake News Detection Practice: Perfect for binary classification tasks.

NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

Feature Engineering: Encourages creating new features from text and metadata.

Balanced Labels: Realistic distribution of real and fake news for fair model training.

Potential Use Cases

Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

Performing exploratory data analysis (EDA) on news data.

Developing pipelines for dealing with missing values and feature extraction.

A Note on Data Quality

This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

File Info

Filename: fake_news_dataset.csv

Size: 20,000 rows × 7 columns

Missing Data: ~5% missing values in the source and author columns.
f
Data from: Strategy of Coupling Artificial Intelligence with Thermodynamic...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siqi Wang; Gabriele Sadowski; Yuanhui Ji (2024). Strategy of Coupling Artificial Intelligence with Thermodynamic Mechanism for Predicting Complex Polymer Viscosities [Dataset]. http://doi.org/10.1021/acssuschemeng.3c08185.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acssuschemeng.3c08185.s003
Dataset updated
Mar 6, 2024
Dataset provided by
ACS Publications
Authors
Siqi Wang; Gabriele Sadowski; Yuanhui Ji
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
With the environmental protection requirements brought about by the large-scale application of polymers in industrial fields, understanding the viscosities of polymers is becoming increasingly important. The different arrangements and crystallinity of the polymers make their viscosities difficult to calculate. To address this challenge, new strategies based on artificial intelligence algorithms are proposed. First, the strategy trains three artificial intelligence algorithms [extreme gradient boosting (XGBoost), convolutional neural network (CNN), and multilayer perceptron (MLP)] based on molecular descriptors of the polymer molecular properties. Next, the PC-SAFT parameters are input into the XGBoost and CNN algorithms as molecular descriptors representing the thermodynamic properties of the polymer to improve the accuracy of the algorithm prediction results. Subsequently, the Molecular ACCess Systems chemical fingerprinting was combined with the XGboost algorithm and CNN algorithm to further improve the accuracy of predicting viscosities. The XGboost algorithm was identified as the best predictive algorithm for predicting the viscosities of the polymer in different states. This discovery is expected to provide effective information for screening polymers for applications in medicine and the chemical industry.
Sustainable Luxury Consumer Survey
kaggle.com
zip
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saima Khan6 (2025). Sustainable Luxury Consumer Survey [Dataset]. https://www.kaggle.com/datasets/saimakhan6/sustainable-luxury-consumer-survey
Explore at:
zip(102395 bytes)Available download formats
Dataset updated
Nov 6, 2025
Authors
Saima Khan6
Description
📖 Description This dataset was designed to explore how psychological traits, attitudinal measures, and motivational drivers influence sustainable luxury purchase intention.It combines quantitative responses from 500 consumers on multiple validated scales — including sustainability attitudes, purchase intention, personality (Big Five), and motivational factors.

The dataset was collected as part of an academic-industry research project on sustainable luxury consumption and consumer psychology. It aims to bridge the gap between marketing theory and predictive analytics by providing a structured, research-grade dataset suitable for both statistical and machine learning modeling. 🎯 Business & Research Use Cases • Predict purchase intention for sustainable luxury products. • Segment consumers based on eco-conscious attitudes and personality traits. • Build marketing analytics models that link sustainability values with buying behavior. • Use for teaching and demonstration in data-driven marketing, consumer analytics, or ethical branding.

🎯 Business & Research Use Cases • Predict purchase intention for sustainable luxury products. • Segment consumers based on eco-conscious attitudes and personality traits. • Build marketing analytics models that link sustainability values with buying behavior. • Use for teaching and demonstration in data-driven marketing, consumer analytics, or ethical branding.

Facebook

Twitter

Click to copy link

Link copied

Cite

Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen (2024). Hyperparameters for the XGBoost model. [Dataset]. http://doi.org/10.1371/journal.pone.0312531.t002

Hyperparameters for the XGBoost model.

Explore at:

299 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0312531.t002

Dataset updated

Nov 27, 2024

Dataset provided by

PLOShttp://plos.org/

Authors

Hoa Thi Trinh; Tuan Anh Pham; Vu Dinh Tho; Duy Hung Nguyen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Structurally, the lateral load-bearing capacity mainly depends on reinforced concrete (RC) walls. Determination of flexural strength and shear strength is mandatory when designing reinforced concrete walls. Typically, these strengths are determined through theoretical formulas and verified experimentally. However, theoretical formulas often have large errors and testing is costly and time-consuming. Therefore, this study exploits machine learning techniques, specifically the hybrid XGBoost model combined with optimization algorithms, to predict the shear strength of RC walls based on model training from available experimental results. The study used the largest database of RC walls to date, consisting of 1057 samples with various cross-sectional shapes. Bayesian optimization (BO) algorithms, including BO—Gaussian Process, BO—Random Forest, and Random Search methods, were used to refine the XGBoost model architecture. The results show that Gaussian Process emerged as the most efficient solution compared to other optimization algorithms, providing the lowest Mean Square Error and achieving a prediction R2 of 0.998 for the training set, 0.972 for the validation set and 0.984 for the test set, while BO—Random Forest and Random Search performed as well on the training and test sets as Gaussian Process but significantly worse on the validation set, specifically R2 on the validation set of BO—Random Forest and Random Search were 0.970 and 0.969 respectively over the entire dataset including all cross-sectional shapes of the RC wall. SHAP (Shapley Additive Explanations) technique was used to clarify the predictive ability of the model and the importance of input variables. Furthermore, the performance of the model was validated through comparative analysis with benchmark models and current standards. Notably, the coefficient of variation (COV %) of the XGBoost model is 13.27%, while traditional models often have COV % exceeding 50%.

Clear search

Close search

Google apps

Main menu

Hyperparameters for the XGBoost model.

Model hyperparameters used for the XGBoost model.

CMAB-The World's First National-Scale Multi-Attribute Building Dataset

Data from: Extreme Gradient Boosting as a Method for Quantitative...

Electricity Demand Historical Data

📹 Project Video available on YouTube - https://youtu.be/iop8TUxmgO0

🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

Electricity Demand Forecasting Dataset (XGBoost Model Ready)

Table_1_A cost-effective, machine learning-driven approach for screening...

A machine learning based prediction model for life expectancy

Fraud Detection Transactions Dataset

Description

📌 Key Features

📌 Columns in the Dataset

📌 Potential Use Cases

Surrogate models used in the study.

The pseudo-code for the XGBoost-Shap.

p-value of the independent t-test comparing the performance of XGBTree with...

Properties of the office cell building model in various climates.

Data_Sheet_1_The effect of reading engagement on scientific literacy – an...

Optimal hyperparameters of the models.

The city crash data feature and variable.

How I Built a Smart Energy Forecast Using XGBoost

Drug Response Prediction Dataset

Fake News Detection Dataset

📚 Fake News Detection Dataset

Overview

Columns Description

Why Use This Dataset?

Potential Use Cases

A Note on Data Quality

File Info

Data from: Strategy of Coupling Artificial Intelligence with Thermodynamic...

Sustainable Luxury Consumer Survey

Hyperparameters for the XGBoost model.