14 datasets found

f
Data from: Advanced machine learning techniques for building performance...
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Debaditya Chakraborty; Hazem Elzarka (2023). Advanced machine learning techniques for building performance simulation: a comparative analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6848453.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6848453.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Debaditya Chakraborty; Hazem Elzarka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Energy consumption predictions for buildings play an important role in energy efficiency and sustainability research. Accurate energy predictions have numerous application in real-time performance monitoring, fault detection, identifying prime targets for energy conservation, quantifying savings resulting from energy efficiency projects, etc. Machine learning-based energy models have proved to be more efficient and accurate where historical time series data is available. This paper presents various machine learning concepts that will aid in the generation of more accurate and efficient energy models. We have shown in detail the development of energy models using extreme gradient boosting (XGBoost), artificial neural network (ANN), and degree-day-based ordinary least square regression. We have presented a thorough description of the workflow, including intermediate steps for feature engineering, feature selection, hyper-parameter optimization and the Python source code. Our results indicate that XGBoost produces highly accurate energy models, and the intermediate steps are particularly important for XGBoost and ANN model development.
f
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
plos.figshare.com
xls
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0291581.t002
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
SenseCobotFusion
zenodo.org
txt, zip
Updated Jan 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Borghi; Simone Borghi; Alberto Nuzzaci; Alberto Nuzzaci; Margherita Peruzzini; Margherita Peruzzini; Valeria Villani; Valeria Villani; Luca Bedogni; Luca Bedogni (2025). SenseCobotFusion [Dataset]. http://doi.org/10.5281/zenodo.14221138
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14221138
Dataset updated
Jan 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simone Borghi; Simone Borghi; Alberto Nuzzaci; Alberto Nuzzaci; Margherita Peruzzini; Margherita Peruzzini; Valeria Villani; Valeria Villani; Luca Bedogni; Luca Bedogni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SenseCobotFusion dataset has been created as a natural evolution of SenseCobot dataset.

SenseCobotFusion dataset collects metrics extracted from ElectroCardioGram (ECG), Galvanic Skin Response (GSR), ElectroEncephaloGram (EEG), and emotion signals obtained with professional biosensors, according to modern state-of-the-art signal processing methods, labeled with a subjective evaluation obtained from widely used NASA-TLX questionnaire.

The signals used for this processing have been obtained from 21 participants engaged in collaborative robotics programming tasks, organized in three phases: an introduction to learning materials, a baseline measurement task to establish reference conditions, and hands-on practice, organized in tasks of increasing complexity: Task 1, Task 2, Task 3, Task 4 and Task 5.

SenseCobotFusion is organized to facilitate statistical investigations, data mining, and machine learning applications as much as possible, and divided into participants and tasks performed: a practical Readme.txt file contains details of the metrics extracted, the nature of the signals of origin, and information on the use of the dataset itself.

A Python code present in this repository has been implemented and optimized with modern state-of-the-art libraries and algorithms to support the researcher in analyzing SenseCobot data (https://zenodo.org/records/10124005), similar datasets, or new related biological signals.

Classic machine learning models such as Decision Tree, Random Forest, SVM, and XGBoost have been trained on SenseCobotFusion to showcase the dataset potential and uploaded to this repository in pickle file format.

Integration with its predecessor SenseCobot dataset, allows the user to implement various types of analysis, such as time series and deep learning approaches.

The SenseCobotFusion dataset, building on the SenseCobot dataset, supports HRC research by providing high-quality, multimodal metrics on mental effort and stress during cobot programming, offering valuable insights for developing intuitive programming interfaces, predictive machine-learning models for real-time stress monitoring, and enhancing human-robot collaboration, while also enabling integration with other datasets, statistical investigation of physical and mental states in Industry 5.0, user-specific machine learning model customization, and the creation of adaptive platforms or technologies aligned with the SenseCobotFusion protocol.

If the SenseCobotFusion_Code code or the SenseCobotFusion dataset is used in whole or in part please credit the authors and this repository.
e
APOGEE red-giant stars spectroscopic age estimates - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). APOGEE red-giant stars spectroscopic age estimates - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/f7cbe0a0-fde5-5932-a4ea-042cafe23466
Explore at:
Dataset updated
Oct 26, 2024
Description
We tabulate spectroscopic stellar age estimates for 178825 red-giant stars observed by the APOGEE survey (Majewski et al., 2017AJ....154...94M, Cat. III/284) with a median statistical uncertainty of 17%. The ages were obtained with the supervised machine-learning technique XGBoost (Chen & Guestrin, 2016, arXiv:1603.02754), trained on a high-quality dataset of 3060 red-giant and red-clump stars with asteroseismic ages observed by both APOGEE and Kepler (Miglio et al., 2021A&A...645A..85M, Cat. J/A+A/645/A85). Two sets of age estimates are delivered in this table: The first five columns contain the results of the fiducial XGBoost model (obtained with version 1.7.6 of the xgboost python package) mostly used in the accompanying paper. The final five columns use a XGBoost quantile regression (using version 2.0.0 of the xgboost python package). Our age estimates constitute a useful database for studying the evolution of the Galactic disc. Cone search capability for table J/A+A/678/A158/catalog (APOGEE spectroscopic age catalogue (table A1))
d
Data from: Representative sample size for estimating saturated hydraulic...
search.dataone.org
beta.hydroshare.org
+1more
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian (2024). Representative sample size for estimating saturated hydraulic conductivity via machine learning [Dataset]. https://search.dataone.org/view/sha256%3A9b40514c6e7aad0079724cc95c1486385d2dfaa7f02ced190cda693925261b53
Explore at:
Dataset updated
May 25, 2024
Dataset provided by
Hydroshare
Authors
Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian
Description
This database including saturated hydraulic conductivity data from the USKSAT database as well as the associated Python codes used to analyze learning curves and train and test the developed machine learning models.

PythonLibraries|WheelFiles

kaggle.com

Updated Mar 25, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Ravi Ramakrishnan (2024). PythonLibraries|WheelFiles [Dataset]. https://www.kaggle.com/datasets/ravi20076/pythonlibrarieswheelfiles/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 25, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ravi Ramakrishnan

License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

Hello all,
This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-

Package	Version	Functionality
AutoGluon	1.0.0	AutoML models
Catboost	1.2.2 1.2.3	ML models
Iterative-Stratification	0.1.7	Iterative stratification for multi-label classifiers
Joblib	1.3.2	File dumping and retrieval
LAMA	0.3.8b1	AutoML models
LightGBM	4.3.0 4.2.0 4.1.0	ML models
MAPIE	0.8.2	Quantile regression
Numpy	1.26.3	Data wrangling
Pandas	2.1.4	Data wrangling
Polars	0.20.3 0.20.4	Data wrangling
PyTorch	2.0.1	Neural networks
PyTorch-TabNet	4.1.0	Neural networks
PyTorch-Forecast	0.7.0	Neural networks
Pygwalker	0.3.20	Data wrangling and visualization
Scikit-learn	1.3.2 1.4.0	ML Models/ Pipelines/ Data wrangling
Scipy	1.11.4	Data wrangling/ Statistics
TabPFN	10.1.9	ML models
Torch-Frame	1.7.5	Neural Networks
TorchVision	0.15.2	Neural Networks
XGBoost	2.0.2 2.0.1 2.0.3	ML models

I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.

Recent updates based on user feedback-

lightgbm 4.1.0 and 4.3.0
Older XGBoost versions (2.0.1 and 2.0.2)
Torch-Frame, TabNet, PyTorch-Forecasting, TorchVision
MAPIE
LAMA 0.3.8b1
Iterative-Stratification
Catboost 1.2.3

Best regards and happy learning and coding!

Software Defects Dataset 1k

kaggle.com

Updated Jun 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ravikumar R N (2025). Software Defects Dataset 1k [Dataset]. https://www.kaggle.com/datasets/ravikumarrn/software-defects-dataset-1k/versions/1

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 16, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ravikumar R N

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

📦 Software Defects Multilingual Dataset with AST & Token Features

This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.

🙋 Please Citation

If you use this dataset in your research or project, please cite it as:

"Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."

🧠 Dataset Highlights

Languages Included: Python, Java, JavaScript, C, C++, Go, Rust
Records: 1,000 code snippets
Labels: defect (1 = buggy, 0 = clean)
Features:
- token_count: Total tokens (AST-based for Python)
- num_ifs, num_returns, num_func_calls: Code structure features
- ast_nodes: Number of nodes in the abstract syntax tree (Python only)
- lines_of_code & cyclomatic_complexity: Simulated metrics for modeling
📊 Columns Description

Column	Description
`function_name`	Unique identifier for the function
`code`	The actual function source code
`language`	Programming language used
`lines_of_code`	Approximate number of lines in the function
`cyclomatic_complexity`	Simulated measure of decision complexity
`defect`	1 = buggy, 0 = clean
`token_count`	Total token count (Python uses AST tokens)
`num_ifs`	Count of 'if' statements
`num_returns`	Count of 'return' statements
`num_func_calls`	Number of function calls
`ast_nodes`	AST node count (Python only, fallback = token count)

🛠️ Usage Examples

This dataset is suitable for:

Training traditional ML models like Random Forests or XGBoost
Evaluating prompt-based or fine-tuned LLMs (e.g., CodeT5, GPT-4)
Feature importance studies using AST and static code metrics
Cross-lingual transfer learning in code understanding

📎** License**

This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.

e
Dataset for 'Stream Temperature Predictions for River Basin Management in...
knb.ecoinformatics.org
search.dataone.org
+2more
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan (2023). Dataset for 'Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning', Water 2022 [Dataset]. http://doi.org/10.15485/1854257
Explore at:
Unique identifier
https://doi.org/10.15485/1854257
Dataset updated
Aug 8, 2023
Dataset provided by
ESS-DIVE
Authors
Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan
Time period covered
Jan 1, 1980 - Jun 30, 2021
Area covered

Description
This data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript ‘Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning’, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.
H
Replication Data for: A framework to decompose process noise in fineblanking...
dataverse.harvard.edu
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Unterberg (2025). Replication Data for: A framework to decompose process noise in fineblanking [Dataset]. http://doi.org/10.7910/DVN/A1B09Y
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/A1B09Y
Dataset updated
Mar 12, 2025
Dataset provided by
Harvard Dataverse
Authors
Martin Unterberg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset Description This dataset contains features extracted with the Python library catch22 with sliding windows (window size 50, stride 10) from force profiles acquired from 39,941 fineblanking shearing phases during a continuous (i.e., one complete tool lifecycle without disassembly of the tool during or between machine runs) experiment. The raw data was preprocessed with drift and tilt correction before feature extraction. The features were provided for: The full shearing path (fullsignal) Extracted between 1.5 mm and 4.5 mm of the shearing path (croppedsignal) Furthermore, tearing data was visually evaluated every 200th process cycle and interpolated. Additionally, SHAP values from feature importance analysis of XGBoost regression models that regressed the features to the tearing data are contained within the dataset both for 'fullsignal' and 'croppedsignal'. All files are .npy arrays with the shape (n_samples, n_features).
e
Machine learning predicted AGNs in HSC-Wide region - Dataset - B2FIND
b2find.eudat.eu
Updated Sep 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Machine learning predicted AGNs in HSC-Wide region - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/59947f8c-edfa-58b5-a044-312c396f1794
Explore at:
Dataset updated
Sep 4, 2020
Description
We investigate the performance of machine-learning techniques in classifying active galactic nuclei (AGNs), including X-ray-selected AGNs (XAGNs), infrared-selected AGNs (IRAGNs), and radio-selected AGNs (RAGNs). Using the known physical parameters in the Cosmic Evolution Survey (COSMOS) field, we are able to create quality training samples in the region of the Hyper Suprime-Cam (HSC) survey. We compare several Python packages (e.g., scikit- learn, Keras, and XGBoost) and use XGBoost to identify AGNs and show the performance (e.g., accuracy, precision, recall, F1 score, and AUROC). Our results indicate that the performance is high for bright XAGN and IRAGN host galaxies. The combination of the HSC (optical) information with the Wide-field Infrared Survey Explorer band 1 and band 2 (near-infrared) information performs well to identify AGN hosts. For both type 1 (broad-line) XAGNs and type 1 (unobscured) IRAGNs, the performance is very good by using optical-to-infrared information. These results can apply to the five-band data from the wide regions of the HSC survey and future all-sky surveys. Cone search capability for table J/ApJ/920/68/table7 (AGN candidates in HSC-Wide region for 112609 objects)
Data from: SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter...
zenodo.org
explore.openaire.eu
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff; Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff (2020). SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter Optimization for Efficiently Segmenting Large Heterogeneous Areas [Dataset]. http://doi.org/10.5281/zenodo.1341116
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1341116
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff; Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains several data, results and processing material from the application of GEOBIA-based, Spatially Partitioned Segmentation Parameter Optimization (SPUSPO) in the city of Ouagadougou. In detail in contains:

A Land Use - Land Cover map of Ouagadougou derived through SPUSPO. The classifier used was Extreme Gradient Boosting (XGBoost).
Labels :

2 : Artificial Ground Surface

0 : Building

5 : Low Vegetation

4 : Tree

1 : Swimming Pool

3 : Bare Ground

7 : Shadow

6 : Inland Water

The training and test data used in the study (SPUSPO and benchmark approach).

The data are given in a csv format.

The Jupyter notebook code which involves Python and GRASS GIS to automatize and efficiently perform SPUSPO in a large dataset.

Python code calling GRASS GIS functions for automatizing the procedure.

The segmentation layers coming from SPUSPO and the benchmark approaches (in raster formats due to data limitations).

Segmentation rasters for each approach.

The R code for optimization of XGBoost as well as feature selection with VSURF and classification of the whole dataset.

Segmentation evaluation metrics.

A csv file with the data sued to compute the Area Fit Index for each approach.

Morphological zones of Ouagadougou as created by Grippa et al. 2017 a shp format.
o
Data from: Simulated wildfire burned area over the CONUS during 2001-2020
osti.gov
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huang, Huilin; Liu, Ye (2024). Simulated wildfire burned area over the CONUS during 2001-2020 [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2424127
Explore at:
Dataset updated
Jul 30, 2024
Dataset provided by
DOE
Pacific Northwest National Laboratory 2
Authors
Huang, Huilin; Liu, Ye
Description
Wildfires have shown increasing trends in both frequency and severity across the Contiguous United States (CONUS). However, process-based fire models have difficulties in accurately simulating the burned area over the CONUS due to a simplification of the physical process and cannot capture the interplay among fire, ignition, climate, and human activities. The deficiency of burned area simulation deteriorates the description of fire impact on energy balance, water budget, and carbon fluxes in the Earth System Models (ESMs). Alternatively, machine learning (ML) based fire models, which capture statistical relationships between the burned area and environmental factors, have shown promising burned area predictions and corresponding fire impact simulation. We develop a hybrid framework (ML4Fire-XGB) that integrates a pretrained eXtreme Gradient Boosting (XGBoost) wildfire model with the Energy Exascale Earth System Model (E3SM) land model (ELM). A Fortran-C-Python deep learning bridge is adapted to support online communication between ELM and the ML fire model. Specifically, the burned area predicted by the ML-based wildfire model is directly passed to ELM to adjust the carbon pool and vegetation dynamics after disturbance, which are then used as predictors in the ML-based fire model in the next time step. Evaluated against the historical burned area from Globalmore » Fire Emissions Database 5 from 2001-2020, the ML4Fire-XGB model outperforms process-based fire models in terms of spatial distribution and seasonal variations. Sensitivity analysis confirms that the ML4Fire-XGB well captures the responses of the burned area to rising temperatures. The ML4Fire-XGB model has proved to be a new tool for studying vegetation-fire interactions, and more importantly, enables seamless exploration of climate-fire feedback, working as an active component in E3SM.« less
f
Spatiotemporal Soil Erosion Dataset for the Yarlung Tsangpo River Basin...
figshare.com
zip
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
peng xin (2025). Spatiotemporal Soil Erosion Dataset for the Yarlung Tsangpo River Basin (1990–2100) [Dataset]. http://doi.org/10.6084/m9.figshare.29095763.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29095763.v1
Dataset updated
May 19, 2025
Dataset provided by
figshare
Authors
peng xin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Yarlung Zangbo River
Description
This dataset includes historical erosion data (1990–2019), future soil erosion projections under SSP126, SSP245, and SSP585 scenarios (2021–2100), and predicted R and C factors for each period.Future R factors We incorporated 25 Global Climate Models (GCMs) from CMIP6 for calculating the future R factors, selected via the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) project (Table S3). The selection was based on the completeness of their time series and their alignment with the selected scenarios. Rainfall projections were corrected using quantile delta mapping (QDM) (Cannon et al., 2015) to address systematic biases in intensity distributions while preserving the projected trends in mean rainfall and extremes—critical for soil erosion analysis (Eekhout and de Vente, 2019). Bias correction was conducted using a 25-year baseline (1990–2014), with adjustments made monthly to correct for seasonal biases. The corrected bias functions were then applied to adjust the years (2020–2100) of daily rainfall data using the "ibicus" package, an open-source Python tool for bias adjustment and climate model evaluation. A minimum daily rainfall threshold of 0.1 mm was used to define rainy days, following established studies (Bulovic, 2024; Eekhout and de Vente, 2019; Switanek et al., 2017). Additionally, the study employed QDM to correct biases in historical GCM simulations, ensuring the applicability of the QDM method for rainfall bias correction in the YTRB. A baseline period of 1990–2010 was selected to establish the bias correction function, which was subsequently applied to adjust GCM simulations for 2011–2014. To evaluate the effectiveness of this calibration, we compared the annual mean precipitation from bias-corrected GCMs during 2011–2014 with observed precipitation data at the pixel level (Figs. S2, S3), using R² as the evaluation metric. The results showed a significant increase in R² after bias correction, confirming the effectiveness of the QDM approach. Future C factors To ensure the accuracy of the C factor predictions, we selected five CMIP6 climate models (table S4) with high spatial resolution compared to other CMIP6 climate models. Of the five selected climate models, CanESM5, IPSL-CM6-LR, and MIROC-ES2L have high equilibrium climate sensitivity (ECS) values. The ECS is the expected long-term warming after doubling of atmospheric CO2 concentrations. It is one of the most important indicators for understanding the impact of future warming (Rao et al., 2023). Therefore, we selected these five climate models with ECS values >3.0 to capture the full range of potential climate-induced changes affecting soil erosion. After selecting the climate models, we constructed an XGBoost model using historical C factor data and bioclimatic variables from the WorldClim data portal. WorldClim provides global gridded datasets with a 1 km² spatial resolution, including 19 bioclimatic variables derived from monthly temperature and precipitation data, reflecting annual trends, seasonality, and extreme environmental conditions (Hijmans et al., 2005). However, strong collinearity among the 19 bioclimatic variables and an excessive number of input features may increase model complexity and reduce XGBoost's predictive accuracy. To optimize performance, we employed Recursive Feature Elimination (RFE), an iterative method for selecting the most relevant features while preserving prediction accuracy. (Kornyo et al., 2023; Xiong et al., 2024). In each iteration, the current subset of features was used to train an XGBoost model, and feature importance was evaluated to remove the least significant variable, gradually refining the feature set. Using 80% of the data for training and 20% for testing, we employed 5-fold cross-validation to determine the feature subset that maximized the average R², ensuring optimal model performance. Additionally, a Genetic Algorithm (GA) was applied in each iteration to optimize the hyperparameters of the XGBoost model, which is crucial for enhancing both the efficiency and robustness of the model (Zhong and Liu, 2024; Zou et al., 2024). Finally, based on the variable selection results from RFE, the bioclimatic variables of future climate models were input into the trained XGBoost model to obtain the average C factor for the five selected climate models across four future periods (2020–2040, 2040–2060, 2060–2080, and 2080–2100). RUSLE model In this study, the mean annual soil loss was initially estimated using the RUSLE model, which enables us to estimate the spatial pattern of soil erosion (Renard et al.,1991). In areas where data are scarce, we consider RUSLE to be an effective runoff dependent soil erosion model because it requires only limited data for the study area (Haile et al.,2012).
Mean Annual Habitat Quality and Its Driving Variables in China (1990–2018)
figshare.com
csv
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ChenXi Zhu; Pedro Cabral (2025). Mean Annual Habitat Quality and Its Driving Variables in China (1990–2018) [Dataset]. http://doi.org/10.6084/m9.figshare.29086178.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29086178.v2
Dataset updated
May 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
ChenXi Zhu; Pedro Cabral
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
This dataset uses the variables listed in Table 1 to train four machine learning models—Linear Regression, Decision Tree, Random Forest, and Extreme Gradient Boosting—to explain the mean annual habitat quality in China from 1990 to 2018. The best-performing model (XGBoost) achieved an R² of 0.8411, a mean absolute error (MAE) of 0.0862, and a root mean square error (RMSE) of 0.1341. All raster data were resampled to a 0.1º spatial resolution using bilinear interpolation and projected to the WGS 1984 World Mercator coordinate system.The dataset includes the following files:A CSV file containing the mean annual values of the dependent variable (habitat quality) and the independent variables across China from 1990 to 2018, based on the data listed in Table 1.(HQ: Habitat Quality; CZ: Climate Zone; FFI: Forest Fragmentation Index; GPP: Gross Primary Productivity; Light: Nighttime Lights; PRE: Mean Annual Precipitation Sum; ASP: Aspect; RAD: Solar Radiation; SLOPE: Slope; TEMP: Mean Annual Temperature; SM: Soil Moisture)A Python script used for modeling habitat quality, including mean encoding of the categorical variable climate zone (CZ), multicollinearity testing using Variance Inflation Factor (VIF), and implementation of four machine learning models to predict habitat quality.Table 1. Variables used in the machine learning modelsDatasetUnitsSourceHabitat Quality-Calculated based on landcover map(Yang and Huang, 2021)Gross Primary ProductivitygC m-2 d-1(Wang et al., 2021)TemperatureºC(Peng et al., 2019)Precipitation0.1mm(Peng et al., 2019)Downward shortwave radiationW m−2(He et al., 2020)Soil moisturem3 m−3(K. Zhang et al., 2024)Nighttime lightDigital Number(L. Zhang et al., 2024)Forest fragmentation index-Derived from landcover map (Yang & Huang, 2021)Digital Elevation Modelm(CGIAR-CSI, 2022)AspectDegreeDerived from DEM(CGIAR-CSI, 2022)SlopeDegreeDerived from DEM(CGIAR-CSI, 2022)Climate zones-(Kottek et al., 2006)ReferencesCGIAR-CSI. (2022). SRTM DEM dataset in China (2000). In National Tibetan Plateau Data Center. National Tibetan Plateau Data Center. https://dx.doi.org/He, J., Yang, K., Tang, W., Lu, H., Qin, J., Chen, Y., & Li, X. (2020). The first high-resolution meteorological forcing dataset for land process studies over China. Scientific Data, 7(1), 25. https://doi.org/10.1038/s41597-020-0369-yKottek, M., Grieser, J., Beck, C., Rudolf, B., & Rubel, F. (2006). World Map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift, 15(3), 259–263. https://doi.org/10.1127/0941-2948/2006/0130Peng, S., Ding, Y., Liu, W., & Li, Z. (2019). 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth System Science Data, 11(4), 1931–1946. https://doi.org/10.5194/essd-11-1931-2019Wang, S., Zhang, Y., Ju, W., Qiu, B., & Zhang, Z. (2021). Tracking the seasonal and inter-annual variations of global gross primary production during last four decades using satellite near-infrared reflectance data. Science of The Total Environment, 755, 142569. https://doi.org/10.1016/j.scitotenv.2020.142569Yang, J., & Huang, X. (2021). The 30 m annual land cover dataset and its dynamics in China from 1990 to 2019. Earth System Science Data, 13(8), 3907–3925. https://doi.org/10.5194/essd-13-3907-2021Zhang, K., Chen, H., Ma, N., Shang, S., Wang, Y., Xu, Q., & Zhu, G. (2024). A global dataset of terrestrial evapotranspiration and soil moisture dynamics from 1982 to 2020. Scientific Data, 11(1), 445. https://doi.org/10.1038/s41597-024-03271-7Zhang, L., Ren, Z., Chen, B., Gong, P., Xu, B., & Fu, H. (2024). A Prolonged Artificial Nighttime-light Dataset of China (1984-2020). Scientific Data, 11(1), 414. https://doi.org/10.1038/s41597-024-03223-1
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Debaditya Chakraborty; Hazem Elzarka (2023). Advanced machine learning techniques for building performance simulation: a comparative analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6848453.v1

Data from: Advanced machine learning techniques for building performance simulation: a comparative analysis

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.6848453.v1

Dataset updated

May 30, 2023

Dataset provided by

Taylor & Francis

Authors

Debaditya Chakraborty; Hazem Elzarka

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Energy consumption predictions for buildings play an important role in energy efficiency and sustainability research. Accurate energy predictions have numerous application in real-time performance monitoring, fault detection, identifying prime targets for energy conservation, quantifying savings resulting from energy efficiency projects, etc. Machine learning-based energy models have proved to be more efficient and accurate where historical time series data is available. This paper presents various machine learning concepts that will aid in the generation of more accurate and efficient energy models. We have shown in detail the development of energy models using extreme gradient boosting (XGBoost), artificial neural network (ANN), and degree-day-based ordinary least square regression. We have presented a thorough description of the workflow, including intermediate steps for feature engineering, feature selection, hyper-parameter optimization and the Python source code. Our results indicate that XGBoost produces highly accurate energy models, and the intermediate steps are particularly important for XGBoost and ANN model development.

Clear search

Close search

Google apps

Main menu

Data from: Advanced machine learning techniques for building performance...

Fit statistics for scored XGBoost models with 50,000 rows per dataset.

SenseCobotFusion

APOGEE red-giant stars spectroscopic age estimates - Dataset - B2FIND

Data from: Representative sample size for estimating saturated hydraulic...

PythonLibraries|WheelFiles

Recent updates based on user feedback-

Software Defects Dataset 1k

Dataset for 'Stream Temperature Predictions for River Basin Management in...

Replication Data for: A framework to decompose process noise in fineblanking...

Machine learning predicted AGNs in HSC-Wide region - Dataset - B2FIND

Data from: SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter...

Data from: Simulated wildfire burned area over the CONUS during 2001-2020

Spatiotemporal Soil Erosion Dataset for the Yarlung Tsangpo River Basin...

Mean Annual Habitat Quality and Its Driving Variables in China (1990–2018)

Data from: Advanced machine learning techniques for building performance simulation: a comparative analysis