14 datasets found
  1. f

    Data from: Advanced machine learning techniques for building performance...

    • tandf.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Debaditya Chakraborty; Hazem Elzarka (2023). Advanced machine learning techniques for building performance simulation: a comparative analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6848453.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Debaditya Chakraborty; Hazem Elzarka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Energy consumption predictions for buildings play an important role in energy efficiency and sustainability research. Accurate energy predictions have numerous application in real-time performance monitoring, fault detection, identifying prime targets for energy conservation, quantifying savings resulting from energy efficiency projects, etc. Machine learning-based energy models have proved to be more efficient and accurate where historical time series data is available. This paper presents various machine learning concepts that will aid in the generation of more accurate and efficient energy models. We have shown in detail the development of energy models using extreme gradient boosting (XGBoost), artificial neural network (ANN), and degree-day-based ordinary least square regression. We have presented a thorough description of the workflow, including intermediate steps for feature engineering, feature selection, hyper-parameter optimization and the Python source code. Our results indicate that XGBoost produces highly accurate energy models, and the intermediate steps are particularly important for XGBoost and ANN model development.

  2. f

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

  3. SenseCobotFusion

    • zenodo.org
    txt, zip
    Updated Jan 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Borghi; Simone Borghi; Alberto Nuzzaci; Alberto Nuzzaci; Margherita Peruzzini; Margherita Peruzzini; Valeria Villani; Valeria Villani; Luca Bedogni; Luca Bedogni (2025). SenseCobotFusion [Dataset]. http://doi.org/10.5281/zenodo.14221138
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    Jan 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Simone Borghi; Simone Borghi; Alberto Nuzzaci; Alberto Nuzzaci; Margherita Peruzzini; Margherita Peruzzini; Valeria Villani; Valeria Villani; Luca Bedogni; Luca Bedogni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SenseCobotFusion dataset has been created as a natural evolution of SenseCobot dataset.

    SenseCobotFusion dataset collects metrics extracted from ElectroCardioGram (ECG), Galvanic Skin Response (GSR), ElectroEncephaloGram (EEG), and emotion signals obtained with professional biosensors, according to modern state-of-the-art signal processing methods, labeled with a subjective evaluation obtained from widely used NASA-TLX questionnaire.

    The signals used for this processing have been obtained from 21 participants engaged in collaborative robotics programming tasks, organized in three phases: an introduction to learning materials, a baseline measurement task to establish reference conditions, and hands-on practice, organized in tasks of increasing complexity: Task 1, Task 2, Task 3, Task 4 and Task 5.


    SenseCobotFusion is organized to facilitate statistical investigations, data mining, and machine learning applications as much as possible, and divided into participants and tasks performed: a practical Readme.txt file contains details of the metrics extracted, the nature of the signals of origin, and information on the use of the dataset itself.

    A Python code present in this repository has been implemented and optimized with modern state-of-the-art libraries and algorithms to support the researcher in analyzing SenseCobot data (https://zenodo.org/records/10124005), similar datasets, or new related biological signals.

    Classic machine learning models such as Decision Tree, Random Forest, SVM, and XGBoost have been trained on SenseCobotFusion to showcase the dataset potential and uploaded to this repository in pickle file format.

    Integration with its predecessor SenseCobot dataset, allows the user to implement various types of analysis, such as time series and deep learning approaches.

    The SenseCobotFusion dataset, building on the SenseCobot dataset, supports HRC research by providing high-quality, multimodal metrics on mental effort and stress during cobot programming, offering valuable insights for developing intuitive programming interfaces, predictive machine-learning models for real-time stress monitoring, and enhancing human-robot collaboration, while also enabling integration with other datasets, statistical investigation of physical and mental states in Industry 5.0, user-specific machine learning model customization, and the creation of adaptive platforms or technologies aligned with the SenseCobotFusion protocol.

    If the SenseCobotFusion_Code code or the SenseCobotFusion dataset is used in whole or in part please credit the authors and this repository.

  4. e

    APOGEE red-giant stars spectroscopic age estimates - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). APOGEE red-giant stars spectroscopic age estimates - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/f7cbe0a0-fde5-5932-a4ea-042cafe23466
    Explore at:
    Dataset updated
    Oct 26, 2024
    Description

    We tabulate spectroscopic stellar age estimates for 178825 red-giant stars observed by the APOGEE survey (Majewski et al., 2017AJ....154...94M, Cat. III/284) with a median statistical uncertainty of 17%. The ages were obtained with the supervised machine-learning technique XGBoost (Chen & Guestrin, 2016, arXiv:1603.02754), trained on a high-quality dataset of 3060 red-giant and red-clump stars with asteroseismic ages observed by both APOGEE and Kepler (Miglio et al., 2021A&A...645A..85M, Cat. J/A+A/645/A85). Two sets of age estimates are delivered in this table: The first five columns contain the results of the fiducial XGBoost model (obtained with version 1.7.6 of the xgboost python package) mostly used in the accompanying paper. The final five columns use a XGBoost quantile regression (using version 2.0.0 of the xgboost python package). Our age estimates constitute a useful database for studying the evolution of the Galactic disc. Cone search capability for table J/A+A/678/A158/catalog (APOGEE spectroscopic age catalogue (table A1))

  5. d

    Data from: Representative sample size for estimating saturated hydraulic...

    • search.dataone.org
    • beta.hydroshare.org
    • +1more
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian (2024). Representative sample size for estimating saturated hydraulic conductivity via machine learning [Dataset]. https://search.dataone.org/view/sha256%3A9b40514c6e7aad0079724cc95c1486385d2dfaa7f02ced190cda693925261b53
    Explore at:
    Dataset updated
    May 25, 2024
    Dataset provided by
    Hydroshare
    Authors
    Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian
    Description

    This database including saturated hydraulic conductivity data from the USKSAT database as well as the associated Python codes used to analyze learning curves and train and test the developed machine learning models.

  6. PythonLibraries|WheelFiles

    • kaggle.com
    Updated Mar 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravi Ramakrishnan (2024). PythonLibraries|WheelFiles [Dataset]. https://www.kaggle.com/datasets/ravi20076/pythonlibrarieswheelfiles/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravi Ramakrishnan
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Hello all,
    This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-

    PackageVersionFunctionality
    AutoGluon1.0.0AutoML models
    Catboost1.2.2
    1.2.3
    ML models
    Iterative-Stratification0.1.7Iterative stratification for multi-label classifiers
    Joblib1.3.2File dumping and retrieval
    LAMA0.3.8b1AutoML models
    LightGBM4.3.0
    4.2.0
    4.1.0
    ML models
    MAPIE0.8.2Quantile regression
    Numpy1.26.3Data wrangling
    Pandas2.1.4Data wrangling
    Polars0.20.3
    0.20.4
    Data wrangling
    PyTorch2.0.1Neural networks
    PyTorch-TabNet4.1.0Neural networks
    PyTorch-Forecast0.7.0Neural networks
    Pygwalker0.3.20Data wrangling and visualization
    Scikit-learn1.3.2
    1.4.0
    ML Models/ Pipelines/ Data wrangling
    Scipy1.11.4Data wrangling/ Statistics
    TabPFN10.1.9ML models
    Torch-Frame1.7.5Neural Networks
    TorchVision0.15.2Neural Networks
    XGBoost2.0.2
    2.0.1
    2.0.3
    ML models


    I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.

    Recent updates based on user feedback-

    1. lightgbm 4.1.0 and 4.3.0
    2. Older XGBoost versions (2.0.1 and 2.0.2)
    3. Torch-Frame, TabNet, PyTorch-Forecasting, TorchVision
    4. MAPIE
    5. LAMA 0.3.8b1
    6. Iterative-Stratification
    7. Catboost 1.2.3

    Best regards and happy learning and coding!

  7. Software Defects Dataset 1k

    • kaggle.com
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravikumar R N (2025). Software Defects Dataset 1k [Dataset]. https://www.kaggle.com/datasets/ravikumarrn/software-defects-dataset-1k/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravikumar R N
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📩 Software Defects Multilingual Dataset with AST & Token Features

    This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.

    🙋 Please Citation

    If you use this dataset in your research or project, please cite it as:

    "Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."

    🧠 Dataset Highlights

    • Languages Included: Python, Java, JavaScript, C, C++, Go, Rust
    • Records: 1,000 code snippets
    • Labels: defect (1 = buggy, 0 = clean)
    • Features:

      • token_count: Total tokens (AST-based for Python)
      • num_ifs, num_returns, num_func_calls: Code structure features
      • ast_nodes: Number of nodes in the abstract syntax tree (Python only)
      • lines_of_code & cyclomatic_complexity: Simulated metrics for modeling

      📊 Columns Description

    ColumnDescription
    function_nameUnique identifier for the function
    codeThe actual function source code
    languageProgramming language used
    lines_of_codeApproximate number of lines in the function
    cyclomatic_complexitySimulated measure of decision complexity
    defect1 = buggy, 0 = clean
    token_countTotal token count (Python uses AST tokens)
    num_ifsCount of 'if' statements
    num_returnsCount of 'return' statements
    num_func_callsNumber of function calls
    ast_nodesAST node count (Python only, fallback = token count)

    đŸ› ïž Usage Examples

    This dataset is suitable for:

    • Training traditional ML models like Random Forests or XGBoost
    • Evaluating prompt-based or fine-tuned LLMs (e.g., CodeT5, GPT-4)
    • Feature importance studies using AST and static code metrics
    • Cross-lingual transfer learning in code understanding

    📎** License**

    This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.

  8. e

    Dataset for 'Stream Temperature Predictions for River Basin Management in...

    • knb.ecoinformatics.org
    • search.dataone.org
    • +2more
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan (2023). Dataset for 'Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning', Water 2022 [Dataset]. http://doi.org/10.15485/1854257
    Explore at:
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan
    Time period covered
    Jan 1, 1980 - Jun 30, 2021
    Area covered
    Description

    This data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript ‘Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning’, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.

  9. H

    Replication Data for: A framework to decompose process noise in fineblanking...

    • dataverse.harvard.edu
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Unterberg (2025). Replication Data for: A framework to decompose process noise in fineblanking [Dataset]. http://doi.org/10.7910/DVN/A1B09Y
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Martin Unterberg
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset Description This dataset contains features extracted with the Python library catch22 with sliding windows (window size 50, stride 10) from force profiles acquired from 39,941 fineblanking shearing phases during a continuous (i.e., one complete tool lifecycle without disassembly of the tool during or between machine runs) experiment. The raw data was preprocessed with drift and tilt correction before feature extraction. The features were provided for: The full shearing path (fullsignal) Extracted between 1.5 mm and 4.5 mm of the shearing path (croppedsignal) Furthermore, tearing data was visually evaluated every 200th process cycle and interpolated. Additionally, SHAP values from feature importance analysis of XGBoost regression models that regressed the features to the tearing data are contained within the dataset both for 'fullsignal' and 'croppedsignal'. All files are .npy arrays with the shape (n_samples, n_features).

  10. e

    Machine learning predicted AGNs in HSC-Wide region - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Sep 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Machine learning predicted AGNs in HSC-Wide region - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/59947f8c-edfa-58b5-a044-312c396f1794
    Explore at:
    Dataset updated
    Sep 4, 2020
    Description

    We investigate the performance of machine-learning techniques in classifying active galactic nuclei (AGNs), including X-ray-selected AGNs (XAGNs), infrared-selected AGNs (IRAGNs), and radio-selected AGNs (RAGNs). Using the known physical parameters in the Cosmic Evolution Survey (COSMOS) field, we are able to create quality training samples in the region of the Hyper Suprime-Cam (HSC) survey. We compare several Python packages (e.g., scikit- learn, Keras, and XGBoost) and use XGBoost to identify AGNs and show the performance (e.g., accuracy, precision, recall, F1 score, and AUROC). Our results indicate that the performance is high for bright XAGN and IRAGN host galaxies. The combination of the HSC (optical) information with the Wide-field Infrared Survey Explorer band 1 and band 2 (near-infrared) information performs well to identify AGN hosts. For both type 1 (broad-line) XAGNs and type 1 (unobscured) IRAGNs, the performance is very good by using optical-to-infrared information. These results can apply to the five-band data from the wide regions of the HSC survey and future all-sky surveys. Cone search capability for table J/ApJ/920/68/table7 (AGN candidates in HSC-Wide region for 112609 objects)

  11. Data from: SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff; Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff (2020). SPUSPO: Spatially Partitioned Unsupervised Segmentation Parameter Optimization for Efficiently Segmenting Large Heterogeneous Areas [Dataset]. http://doi.org/10.5281/zenodo.1341116
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff; Stefanos Georganos; Tais Grippa; Moritz Lennert; Brian Alan Johnson; Sabine Vanhuysse; Eléonore Wolff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains several data, results and processing material from the application of GEOBIA-based, Spatially Partitioned Segmentation Parameter Optimization (SPUSPO) in the city of Ouagadougou. In detail in contains:

    • A Land Use - Land Cover map of Ouagadougou derived through SPUSPO. The classifier used was Extreme Gradient Boosting (XGBoost).

      Labels :

      2 : Artificial Ground Surface

      0 : Building

      5 : Low Vegetation

      4 : Tree

      1 : Swimming Pool

      3 : Bare Ground

      7 : Shadow

      6 : Inland Water

    • The training and test data used in the study (SPUSPO and benchmark approach).

    The data are given in a csv format.

    • The Jupyter notebook code which involves Python and GRASS GIS to automatize and efficiently perform SPUSPO in a large dataset.

    Python code calling GRASS GIS functions for automatizing the procedure.

    • The segmentation layers coming from SPUSPO and the benchmark approaches (in raster formats due to data limitations).

    Segmentation rasters for each approach.

    • The R code for optimization of XGBoost as well as feature selection with VSURF and classification of the whole dataset.

    • Segmentation evaluation metrics.

    A csv file with the data sued to compute the Area Fit Index for each approach.

    • Morphological zones of Ouagadougou as created by Grippa et al. 2017 a shp format.
  12. o

    Data from: Simulated wildfire burned area over the CONUS during 2001-2020

    • osti.gov
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Huilin; Liu, Ye (2024). Simulated wildfire burned area over the CONUS during 2001-2020 [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2424127
    Explore at:
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    DOE
    Pacific Northwest National Laboratory 2
    Authors
    Huang, Huilin; Liu, Ye
    Description

    Wildfires have shown increasing trends in both frequency and severity across the Contiguous United States (CONUS). However, process-based fire models have difficulties in accurately simulating the burned area over the CONUS due to a simplification of the physical process and cannot capture the interplay among fire, ignition, climate, and human activities. The deficiency of burned area simulation deteriorates the description of fire impact on energy balance, water budget, and carbon fluxes in the Earth System Models (ESMs). Alternatively, machine learning (ML) based fire models, which capture statistical relationships between the burned area and environmental factors, have shown promising burned area predictions and corresponding fire impact simulation. We develop a hybrid framework (ML4Fire-XGB) that integrates a pretrained eXtreme Gradient Boosting (XGBoost) wildfire model with the Energy Exascale Earth System Model (E3SM) land model (ELM). A Fortran-C-Python deep learning bridge is adapted to support online communication between ELM and the ML fire model. Specifically, the burned area predicted by the ML-based wildfire model is directly passed to ELM to adjust the carbon pool and vegetation dynamics after disturbance, which are then used as predictors in the ML-based fire model in the next time step. Evaluated against the historical burned area from Globalmore » Fire Emissions Database 5 from 2001-2020, the ML4Fire-XGB model outperforms process-based fire models in terms of spatial distribution and seasonal variations. Sensitivity analysis confirms that the ML4Fire-XGB well captures the responses of the burned area to rising temperatures. The ML4Fire-XGB model has proved to be a new tool for studying vegetation-fire interactions, and more importantly, enables seamless exploration of climate-fire feedback, working as an active component in E3SM.« less

  13. f

    Spatiotemporal Soil Erosion Dataset for the Yarlung Tsangpo River Basin...

    • figshare.com
    zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    peng xin (2025). Spatiotemporal Soil Erosion Dataset for the Yarlung Tsangpo River Basin (1990–2100) [Dataset]. http://doi.org/10.6084/m9.figshare.29095763.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    figshare
    Authors
    peng xin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Yarlung Zangbo River
    Description

    This dataset includes historical erosion data (1990–2019), future soil erosion projections under SSP126, SSP245, and SSP585 scenarios (2021–2100), and predicted R and C factors for each period.Future R factors We incorporated 25 Global Climate Models (GCMs) from CMIP6 for calculating the future R factors, selected via the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) project (Table S3). The selection was based on the completeness of their time series and their alignment with the selected scenarios. Rainfall projections were corrected using quantile delta mapping (QDM) (Cannon et al., 2015) to address systematic biases in intensity distributions while preserving the projected trends in mean rainfall and extremes—critical for soil erosion analysis (Eekhout and de Vente, 2019). Bias correction was conducted using a 25-year baseline (1990–2014), with adjustments made monthly to correct for seasonal biases. The corrected bias functions were then applied to adjust the years (2020–2100) of daily rainfall data using the "ibicus" package, an open-source Python tool for bias adjustment and climate model evaluation. A minimum daily rainfall threshold of 0.1 mm was used to define rainy days, following established studies (Bulovic, 2024; Eekhout and de Vente, 2019; Switanek et al., 2017). Additionally, the study employed QDM to correct biases in historical GCM simulations, ensuring the applicability of the QDM method for rainfall bias correction in the YTRB. A baseline period of 1990–2010 was selected to establish the bias correction function, which was subsequently applied to adjust GCM simulations for 2011–2014. To evaluate the effectiveness of this calibration, we compared the annual mean precipitation from bias-corrected GCMs during 2011–2014 with observed precipitation data at the pixel level (Figs. S2, S3), using RÂČ as the evaluation metric. The results showed a significant increase in RÂČ after bias correction, confirming the effectiveness of the QDM approach. Future C factors To ensure the accuracy of the C factor predictions, we selected five CMIP6 climate models (table S4) with high spatial resolution compared to other CMIP6 climate models. Of the five selected climate models, CanESM5, IPSL-CM6-LR, and MIROC-ES2L have high equilibrium climate sensitivity (ECS) values. The ECS is the expected long-term warming after doubling of atmospheric CO2 concentrations. It is one of the most important indicators for understanding the impact of future warming (Rao et al., 2023). Therefore, we selected these five climate models with ECS values >3.0 to capture the full range of potential climate-induced changes affecting soil erosion. After selecting the climate models, we constructed an XGBoost model using historical C factor data and bioclimatic variables from the WorldClim data portal. WorldClim provides global gridded datasets with a 1 kmÂČ spatial resolution, including 19 bioclimatic variables derived from monthly temperature and precipitation data, reflecting annual trends, seasonality, and extreme environmental conditions (Hijmans et al., 2005). However, strong collinearity among the 19 bioclimatic variables and an excessive number of input features may increase model complexity and reduce XGBoost's predictive accuracy. To optimize performance, we employed Recursive Feature Elimination (RFE), an iterative method for selecting the most relevant features while preserving prediction accuracy. (Kornyo et al., 2023; Xiong et al., 2024). In each iteration, the current subset of features was used to train an XGBoost model, and feature importance was evaluated to remove the least significant variable, gradually refining the feature set. Using 80% of the data for training and 20% for testing, we employed 5-fold cross-validation to determine the feature subset that maximized the average RÂČ, ensuring optimal model performance. Additionally, a Genetic Algorithm (GA) was applied in each iteration to optimize the hyperparameters of the XGBoost model, which is crucial for enhancing both the efficiency and robustness of the model (Zhong and Liu, 2024; Zou et al., 2024). Finally, based on the variable selection results from RFE, the bioclimatic variables of future climate models were input into the trained XGBoost model to obtain the average C factor for the five selected climate models across four future periods (2020–2040, 2040–2060, 2060–2080, and 2080–2100). RUSLE model In this study, the mean annual soil loss was initially estimated using the RUSLE model, which enables us to estimate the spatial pattern of soil erosion (Renard et al.,1991). In areas where data are scarce, we consider RUSLE to be an effective runoff dependent soil erosion model because it requires only limited data for the study area (Haile et al.,2012).

  14. Mean Annual Habitat Quality and Its Driving Variables in China (1990–2018)

    • figshare.com
    csv
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ChenXi Zhu; Pedro Cabral (2025). Mean Annual Habitat Quality and Its Driving Variables in China (1990–2018) [Dataset]. http://doi.org/10.6084/m9.figshare.29086178.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 18, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    ChenXi Zhu; Pedro Cabral
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    This dataset uses the variables listed in Table 1 to train four machine learning models—Linear Regression, Decision Tree, Random Forest, and Extreme Gradient Boosting—to explain the mean annual habitat quality in China from 1990 to 2018. The best-performing model (XGBoost) achieved an RÂČ of 0.8411, a mean absolute error (MAE) of 0.0862, and a root mean square error (RMSE) of 0.1341. All raster data were resampled to a 0.1Âș spatial resolution using bilinear interpolation and projected to the WGS 1984 World Mercator coordinate system.The dataset includes the following files:A CSV file containing the mean annual values of the dependent variable (habitat quality) and the independent variables across China from 1990 to 2018, based on the data listed in Table 1.(HQ: Habitat Quality; CZ: Climate Zone; FFI: Forest Fragmentation Index; GPP: Gross Primary Productivity; Light: Nighttime Lights; PRE: Mean Annual Precipitation Sum; ASP: Aspect; RAD: Solar Radiation; SLOPE: Slope; TEMP: Mean Annual Temperature; SM: Soil Moisture)A Python script used for modeling habitat quality, including mean encoding of the categorical variable climate zone (CZ), multicollinearity testing using Variance Inflation Factor (VIF), and implementation of four machine learning models to predict habitat quality.Table 1. Variables used in the machine learning modelsDatasetUnitsSourceHabitat Quality-Calculated based on landcover map(Yang and Huang, 2021)Gross Primary ProductivitygC m-2 d-1(Wang et al., 2021)TemperatureÂșC(Peng et al., 2019)Precipitation0.1mm(Peng et al., 2019)Downward shortwave radiationW m−2(He et al., 2020)Soil moisturem3 m−3(K. Zhang et al., 2024)Nighttime lightDigital Number(L. Zhang et al., 2024)Forest fragmentation index-Derived from landcover map (Yang & Huang, 2021)Digital Elevation Modelm(CGIAR-CSI, 2022)AspectDegreeDerived from DEM(CGIAR-CSI, 2022)SlopeDegreeDerived from DEM(CGIAR-CSI, 2022)Climate zones-(Kottek et al., 2006)ReferencesCGIAR-CSI. (2022). SRTM DEM dataset in China (2000). In National Tibetan Plateau Data Center. National Tibetan Plateau Data Center. https://dx.doi.org/He, J., Yang, K., Tang, W., Lu, H., Qin, J., Chen, Y., & Li, X. (2020). The first high-resolution meteorological forcing dataset for land process studies over China. Scientific Data, 7(1), 25. https://doi.org/10.1038/s41597-020-0369-yKottek, M., Grieser, J., Beck, C., Rudolf, B., & Rubel, F. (2006). World Map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift, 15(3), 259–263. https://doi.org/10.1127/0941-2948/2006/0130Peng, S., Ding, Y., Liu, W., & Li, Z. (2019). 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth System Science Data, 11(4), 1931–1946. https://doi.org/10.5194/essd-11-1931-2019Wang, S., Zhang, Y., Ju, W., Qiu, B., & Zhang, Z. (2021). Tracking the seasonal and inter-annual variations of global gross primary production during last four decades using satellite near-infrared reflectance data. Science of The Total Environment, 755, 142569. https://doi.org/10.1016/j.scitotenv.2020.142569Yang, J., & Huang, X. (2021). The 30 m annual land cover dataset and its dynamics in China from 1990 to 2019. Earth System Science Data, 13(8), 3907–3925. https://doi.org/10.5194/essd-13-3907-2021Zhang, K., Chen, H., Ma, N., Shang, S., Wang, Y., Xu, Q., & Zhu, G. (2024). A global dataset of terrestrial evapotranspiration and soil moisture dynamics from 1982 to 2020. Scientific Data, 11(1), 445. https://doi.org/10.1038/s41597-024-03271-7Zhang, L., Ren, Z., Chen, B., Gong, P., Xu, B., & Fu, H. (2024). A Prolonged Artificial Nighttime-light Dataset of China (1984-2020). Scientific Data, 11(1), 414. https://doi.org/10.1038/s41597-024-03223-1

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Debaditya Chakraborty; Hazem Elzarka (2023). Advanced machine learning techniques for building performance simulation: a comparative analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6848453.v1

Data from: Advanced machine learning techniques for building performance simulation: a comparative analysis

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Debaditya Chakraborty; Hazem Elzarka
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Energy consumption predictions for buildings play an important role in energy efficiency and sustainability research. Accurate energy predictions have numerous application in real-time performance monitoring, fault detection, identifying prime targets for energy conservation, quantifying savings resulting from energy efficiency projects, etc. Machine learning-based energy models have proved to be more efficient and accurate where historical time series data is available. This paper presents various machine learning concepts that will aid in the generation of more accurate and efficient energy models. We have shown in detail the development of energy models using extreme gradient boosting (XGBoost), artificial neural network (ANN), and degree-day-based ordinary least square regression. We have presented a thorough description of the workflow, including intermediate steps for feature engineering, feature selection, hyper-parameter optimization and the Python source code. Our results indicate that XGBoost produces highly accurate energy models, and the intermediate steps are particularly important for XGBoost and ANN model development.

Search
Clear search
Close search
Google apps
Main menu