Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Energy consumption predictions for buildings play an important role in energy efficiency and sustainability research. Accurate energy predictions have numerous application in real-time performance monitoring, fault detection, identifying prime targets for energy conservation, quantifying savings resulting from energy efficiency projects, etc. Machine learning-based energy models have proved to be more efficient and accurate where historical time series data is available. This paper presents various machine learning concepts that will aid in the generation of more accurate and efficient energy models. We have shown in detail the development of energy models using extreme gradient boosting (XGBoost), artificial neural network (ANN), and degree-day-based ordinary least square regression. We have presented a thorough description of the workflow, including intermediate steps for feature engineering, feature selection, hyper-parameter optimization and the Python source code. Our results indicate that XGBoost produces highly accurate energy models, and the intermediate steps are particularly important for XGBoost and ANN model development.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SenseCobotFusion dataset has been created as a natural evolution of SenseCobot dataset.
SenseCobotFusion dataset collects metrics extracted from ElectroCardioGram (ECG), Galvanic Skin Response (GSR), ElectroEncephaloGram (EEG), and emotion signals obtained with professional biosensors, according to modern state-of-the-art signal processing methods, labeled with a subjective evaluation obtained from widely used NASA-TLX questionnaire.
The signals used for this processing have been obtained from 21 participants engaged in collaborative robotics programming tasks, organized in three phases: an introduction to learning materials, a baseline measurement task to establish reference conditions, and hands-on practice, organized in tasks of increasing complexity: Task 1, Task 2, Task 3, Task 4 and Task 5.
SenseCobotFusion is organized to facilitate statistical investigations, data mining, and machine learning applications as much as possible, and divided into participants and tasks performed: a practical Readme.txt file contains details of the metrics extracted, the nature of the signals of origin, and information on the use of the dataset itself.
A Python code present in this repository has been implemented and optimized with modern state-of-the-art libraries and algorithms to support the researcher in analyzing SenseCobot data (https://zenodo.org/records/10124005), similar datasets, or new related biological signals.
Classic machine learning models such as Decision Tree, Random Forest, SVM, and XGBoost have been trained on SenseCobotFusion to showcase the dataset potential and uploaded to this repository in pickle file format.
Integration with its predecessor SenseCobot dataset, allows the user to implement various types of analysis, such as time series and deep learning approaches.
The SenseCobotFusion dataset, building on the SenseCobot dataset, supports HRC research by providing high-quality, multimodal metrics on mental effort and stress during cobot programming, offering valuable insights for developing intuitive programming interfaces, predictive machine-learning models for real-time stress monitoring, and enhancing human-robot collaboration, while also enabling integration with other datasets, statistical investigation of physical and mental states in Industry 5.0, user-specific machine learning model customization, and the creation of adaptive platforms or technologies aligned with the SenseCobotFusion protocol.
If the SenseCobotFusion_Code code or the SenseCobotFusion dataset is used in whole or in part please credit the authors and this repository.
We tabulate spectroscopic stellar age estimates for 178825 red-giant stars observed by the APOGEE survey (Majewski et al., 2017AJ....154...94M, Cat. III/284) with a median statistical uncertainty of 17%. The ages were obtained with the supervised machine-learning technique XGBoost (Chen & Guestrin, 2016, arXiv:1603.02754), trained on a high-quality dataset of 3060 red-giant and red-clump stars with asteroseismic ages observed by both APOGEE and Kepler (Miglio et al., 2021A&A...645A..85M, Cat. J/A+A/645/A85). Two sets of age estimates are delivered in this table: The first five columns contain the results of the fiducial XGBoost model (obtained with version 1.7.6 of the xgboost python package) mostly used in the accompanying paper. The final five columns use a XGBoost quantile regression (using version 2.0.0 of the xgboost python package). Our age estimates constitute a useful database for studying the evolution of the Galactic disc. Cone search capability for table J/A+A/678/A158/catalog (APOGEE spectroscopic age catalogue (table A1))
This database including saturated hydraulic conductivity data from the USKSAT database as well as the associated Python codes used to analyze learning curves and train and test the developed machine learning models.
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Hello all,
This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-
Package | Version | Functionality |
---|---|---|
AutoGluon | 1.0.0 | AutoML models |
Catboost | 1.2.2 1.2.3 | ML models |
Iterative-Stratification | 0.1.7 | Iterative stratification for multi-label classifiers |
Joblib | 1.3.2 | File dumping and retrieval |
LAMA | 0.3.8b1 | AutoML models |
LightGBM | 4.3.0 4.2.0 4.1.0 | ML models |
MAPIE | 0.8.2 | Quantile regression |
Numpy | 1.26.3 | Data wrangling |
Pandas | 2.1.4 | Data wrangling |
Polars | 0.20.3 0.20.4 | Data wrangling |
PyTorch | 2.0.1 | Neural networks |
PyTorch-TabNet | 4.1.0 | Neural networks |
PyTorch-Forecast | 0.7.0 | Neural networks |
Pygwalker | 0.3.20 | Data wrangling and visualization |
Scikit-learn | 1.3.2 1.4.0 | ML Models/ Pipelines/ Data wrangling |
Scipy | 1.11.4 | Data wrangling/ Statistics |
TabPFN | 10.1.9 | ML models |
Torch-Frame | 1.7.5 | Neural Networks |
TorchVision | 0.15.2 | Neural Networks |
XGBoost | 2.0.2 2.0.1 2.0.3 | ML models |
I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.
Best regards and happy learning and coding!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
đŠ Software Defects Multilingual Dataset with AST & Token Features
This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.
đ Please Citation
If you use this dataset in your research or project, please cite it as:
"Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."
đ§ Dataset Highlights
defect
(1 = buggy, 0 = clean)Features:
token_count
: Total tokens (AST-based for Python)num_ifs
, num_returns
, num_func_calls
: Code structure featuresast_nodes
: Number of nodes in the abstract syntax tree (Python only)lines_of_code
& cyclomatic_complexity
: Simulated metrics for modelingđ Columns Description
Column | Description |
---|---|
function_name | Unique identifier for the function |
code | The actual function source code |
language | Programming language used |
lines_of_code | Approximate number of lines in the function |
cyclomatic_complexity | Simulated measure of decision complexity |
defect | 1 = buggy, 0 = clean |
token_count | Total token count (Python uses AST tokens) |
num_ifs | Count of 'if' statements |
num_returns | Count of 'return' statements |
num_func_calls | Number of function calls |
ast_nodes | AST node count (Python only, fallback = token count) |
đ ïž Usage Examples
This dataset is suitable for:
đ** License**
This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.
This data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript âStream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learningâ, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset Description This dataset contains features extracted with the Python library catch22 with sliding windows (window size 50, stride 10) from force profiles acquired from 39,941 fineblanking shearing phases during a continuous (i.e., one complete tool lifecycle without disassembly of the tool during or between machine runs) experiment. The raw data was preprocessed with drift and tilt correction before feature extraction. The features were provided for: The full shearing path (fullsignal) Extracted between 1.5 mm and 4.5 mm of the shearing path (croppedsignal) Furthermore, tearing data was visually evaluated every 200th process cycle and interpolated. Additionally, SHAP values from feature importance analysis of XGBoost regression models that regressed the features to the tearing data are contained within the dataset both for 'fullsignal' and 'croppedsignal'. All files are .npy arrays with the shape (n_samples, n_features).
We investigate the performance of machine-learning techniques in classifying active galactic nuclei (AGNs), including X-ray-selected AGNs (XAGNs), infrared-selected AGNs (IRAGNs), and radio-selected AGNs (RAGNs). Using the known physical parameters in the Cosmic Evolution Survey (COSMOS) field, we are able to create quality training samples in the region of the Hyper Suprime-Cam (HSC) survey. We compare several Python packages (e.g., scikit- learn, Keras, and XGBoost) and use XGBoost to identify AGNs and show the performance (e.g., accuracy, precision, recall, F1 score, and AUROC). Our results indicate that the performance is high for bright XAGN and IRAGN host galaxies. The combination of the HSC (optical) information with the Wide-field Infrared Survey Explorer band 1 and band 2 (near-infrared) information performs well to identify AGN hosts. For both type 1 (broad-line) XAGNs and type 1 (unobscured) IRAGNs, the performance is very good by using optical-to-infrared information. These results can apply to the five-band data from the wide regions of the HSC survey and future all-sky surveys. Cone search capability for table J/ApJ/920/68/table7 (AGN candidates in HSC-Wide region for 112609 objects)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains several data, results and processing material from the application of GEOBIA-based, Spatially Partitioned Segmentation Parameter Optimization (SPUSPO) in the city of Ouagadougou. In detail in contains:
Labels :
2 : Artificial Ground Surface
0 : Building
5 : Low Vegetation
4 : Tree
1 : Swimming Pool
3 : Bare Ground
7 : Shadow
6 : Inland Water
The data are given in a csv format.
Python code calling GRASS GIS functions for automatizing the procedure.
Segmentation rasters for each approach.
A csv file with the data sued to compute the Area Fit Index for each approach.
Wildfires have shown increasing trends in both frequency and severity across the Contiguous United States (CONUS). However, process-based fire models have difficulties in accurately simulating the burned area over the CONUS due to a simplification of the physical process and cannot capture the interplay among fire, ignition, climate, and human activities. The deficiency of burned area simulation deteriorates the description of fire impact on energy balance, water budget, and carbon fluxes in the Earth System Models (ESMs). Alternatively, machine learning (ML) based fire models, which capture statistical relationships between the burned area and environmental factors, have shown promising burned area predictions and corresponding fire impact simulation. We develop a hybrid framework (ML4Fire-XGB) that integrates a pretrained eXtreme Gradient Boosting (XGBoost) wildfire model with the Energy Exascale Earth System Model (E3SM) land model (ELM). A Fortran-C-Python deep learning bridge is adapted to support online communication between ELM and the ML fire model. Specifically, the burned area predicted by the ML-based wildfire model is directly passed to ELM to adjust the carbon pool and vegetation dynamics after disturbance, which are then used as predictors in the ML-based fire model in the next time step. Evaluated against the historical burned area from Globalmore » Fire Emissions Database 5 from 2001-2020, the ML4Fire-XGB model outperforms process-based fire models in terms of spatial distribution and seasonal variations. Sensitivity analysis confirms that the ML4Fire-XGB well captures the responses of the burned area to rising temperatures. The ML4Fire-XGB model has proved to be a new tool for studying vegetation-fire interactions, and more importantly, enables seamless exploration of climate-fire feedback, working as an active component in E3SM.« less
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes historical erosion data (1990â2019), future soil erosion projections under SSP126, SSP245, and SSP585 scenarios (2021â2100), and predicted R and C factors for each period.Future R factors We incorporated 25 Global Climate Models (GCMs) from CMIP6 for calculating the future R factors, selected via the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) project (Table S3). The selection was based on the completeness of their time series and their alignment with the selected scenarios. Rainfall projections were corrected using quantile delta mapping (QDM) (Cannon et al., 2015) to address systematic biases in intensity distributions while preserving the projected trends in mean rainfall and extremesâcritical for soil erosion analysis (Eekhout and de Vente, 2019). Bias correction was conducted using a 25-year baseline (1990â2014), with adjustments made monthly to correct for seasonal biases. The corrected bias functions were then applied to adjust the years (2020â2100) of daily rainfall data using the "ibicus" package, an open-source Python tool for bias adjustment and climate model evaluation. A minimum daily rainfall threshold of 0.1 mm was used to define rainy days, following established studies (Bulovic, 2024; Eekhout and de Vente, 2019; Switanek et al., 2017). Additionally, the study employed QDM to correct biases in historical GCM simulations, ensuring the applicability of the QDM method for rainfall bias correction in the YTRB. A baseline period of 1990â2010 was selected to establish the bias correction function, which was subsequently applied to adjust GCM simulations for 2011â2014. To evaluate the effectiveness of this calibration, we compared the annual mean precipitation from bias-corrected GCMs during 2011â2014 with observed precipitation data at the pixel level (Figs. S2, S3), using RÂČ as the evaluation metric. The results showed a significant increase in RÂČ after bias correction, confirming the effectiveness of the QDM approach. Future C factors To ensure the accuracy of the C factor predictions, we selected five CMIP6 climate models (table S4) with high spatial resolution compared to other CMIP6 climate models. Of the five selected climate models, CanESM5, IPSL-CM6-LR, and MIROC-ES2L have high equilibrium climate sensitivity (ECS) values. The ECS is the expected long-term warming after doubling of atmospheric CO2 concentrations. It is one of the most important indicators for understanding the impact of future warming (Rao et al., 2023). Therefore, we selected these five climate models with ECS values >3.0 to capture the full range of potential climate-induced changes affecting soil erosion. After selecting the climate models, we constructed an XGBoost model using historical C factor data and bioclimatic variables from the WorldClim data portal. WorldClim provides global gridded datasets with a 1 kmÂČ spatial resolution, including 19 bioclimatic variables derived from monthly temperature and precipitation data, reflecting annual trends, seasonality, and extreme environmental conditions (Hijmans et al., 2005). However, strong collinearity among the 19 bioclimatic variables and an excessive number of input features may increase model complexity and reduce XGBoost's predictive accuracy. To optimize performance, we employed Recursive Feature Elimination (RFE), an iterative method for selecting the most relevant features while preserving prediction accuracy. (Kornyo et al., 2023; Xiong et al., 2024). In each iteration, the current subset of features was used to train an XGBoost model, and feature importance was evaluated to remove the least significant variable, gradually refining the feature set. Using 80% of the data for training and 20% for testing, we employed 5-fold cross-validation to determine the feature subset that maximized the average RÂČ, ensuring optimal model performance. Additionally, a Genetic Algorithm (GA) was applied in each iteration to optimize the hyperparameters of the XGBoost model, which is crucial for enhancing both the efficiency and robustness of the model (Zhong and Liu, 2024; Zou et al., 2024). Finally, based on the variable selection results from RFE, the bioclimatic variables of future climate models were input into the trained XGBoost model to obtain the average C factor for the five selected climate models across four future periods (2020â2040, 2040â2060, 2060â2080, and 2080â2100). RUSLE model In this study, the mean annual soil loss was initially estimated using the RUSLE model, which enables us to estimate the spatial pattern of soil erosion (Renard et al.,1991). In areas where data are scarce, we consider RUSLE to be an effective runoff dependent soil erosion model because it requires only limited data for the study area (Haile et al.,2012).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset uses the variables listed in Table 1 to train four machine learning modelsâLinear Regression, Decision Tree, Random Forest, and Extreme Gradient Boostingâto explain the mean annual habitat quality in China from 1990 to 2018. The best-performing model (XGBoost) achieved an RÂČ of 0.8411, a mean absolute error (MAE) of 0.0862, and a root mean square error (RMSE) of 0.1341. All raster data were resampled to a 0.1Âș spatial resolution using bilinear interpolation and projected to the WGS 1984 World Mercator coordinate system.The dataset includes the following files:A CSV file containing the mean annual values of the dependent variable (habitat quality) and the independent variables across China from 1990 to 2018, based on the data listed in Table 1.(HQ: Habitat Quality; CZ: Climate Zone; FFI: Forest Fragmentation Index; GPP: Gross Primary Productivity; Light: Nighttime Lights; PRE: Mean Annual Precipitation Sum; ASP: Aspect; RAD: Solar Radiation; SLOPE: Slope; TEMP: Mean Annual Temperature; SM: Soil Moisture)A Python script used for modeling habitat quality, including mean encoding of the categorical variable climate zone (CZ), multicollinearity testing using Variance Inflation Factor (VIF), and implementation of four machine learning models to predict habitat quality.Table 1. Variables used in the machine learning modelsDatasetUnitsSourceHabitat Quality-Calculated based on landcover map(Yang and Huang, 2021)Gross Primary ProductivitygC m-2 d-1(Wang et al., 2021)TemperatureÂșC(Peng et al., 2019)Precipitation0.1mm(Peng et al., 2019)Downward shortwave radiationW mâ2(He et al., 2020)Soil moisturem3 mâ3(K. Zhang et al., 2024)Nighttime lightDigital Number(L. Zhang et al., 2024)Forest fragmentation index-Derived from landcover map (Yang & Huang, 2021)Digital Elevation Modelm(CGIAR-CSI, 2022)AspectDegreeDerived from DEM(CGIAR-CSI, 2022)SlopeDegreeDerived from DEM(CGIAR-CSI, 2022)Climate zones-(Kottek et al., 2006)ReferencesCGIAR-CSI. (2022). SRTM DEM dataset in China (2000). In National Tibetan Plateau Data Center. National Tibetan Plateau Data Center. https://dx.doi.org/He, J., Yang, K., Tang, W., Lu, H., Qin, J., Chen, Y., & Li, X. (2020). The first high-resolution meteorological forcing dataset for land process studies over China. Scientific Data, 7(1), 25. https://doi.org/10.1038/s41597-020-0369-yKottek, M., Grieser, J., Beck, C., Rudolf, B., & Rubel, F. (2006). World Map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift, 15(3), 259â263. https://doi.org/10.1127/0941-2948/2006/0130Peng, S., Ding, Y., Liu, W., & Li, Z. (2019). 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth System Science Data, 11(4), 1931â1946. https://doi.org/10.5194/essd-11-1931-2019Wang, S., Zhang, Y., Ju, W., Qiu, B., & Zhang, Z. (2021). Tracking the seasonal and inter-annual variations of global gross primary production during last four decades using satellite near-infrared reflectance data. Science of The Total Environment, 755, 142569. https://doi.org/10.1016/j.scitotenv.2020.142569Yang, J., & Huang, X. (2021). The 30 m annual land cover dataset and its dynamics in China from 1990 to 2019. Earth System Science Data, 13(8), 3907â3925. https://doi.org/10.5194/essd-13-3907-2021Zhang, K., Chen, H., Ma, N., Shang, S., Wang, Y., Xu, Q., & Zhu, G. (2024). A global dataset of terrestrial evapotranspiration and soil moisture dynamics from 1982 to 2020. Scientific Data, 11(1), 445. https://doi.org/10.1038/s41597-024-03271-7Zhang, L., Ren, Z., Chen, B., Gong, P., Xu, B., & Fu, H. (2024). A Prolonged Artificial Nighttime-light Dataset of China (1984-2020). Scientific Data, 11(1), 414. https://doi.org/10.1038/s41597-024-03223-1
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Energy consumption predictions for buildings play an important role in energy efficiency and sustainability research. Accurate energy predictions have numerous application in real-time performance monitoring, fault detection, identifying prime targets for energy conservation, quantifying savings resulting from energy efficiency projects, etc. Machine learning-based energy models have proved to be more efficient and accurate where historical time series data is available. This paper presents various machine learning concepts that will aid in the generation of more accurate and efficient energy models. We have shown in detail the development of energy models using extreme gradient boosting (XGBoost), artificial neural network (ANN), and degree-day-based ordinary least square regression. We have presented a thorough description of the workflow, including intermediate steps for feature engineering, feature selection, hyper-parameter optimization and the Python source code. Our results indicate that XGBoost produces highly accurate energy models, and the intermediate steps are particularly important for XGBoost and ANN model development.