24 datasets found
  1. f

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    John Prindle; Himal Suthar; Emily Putnam-Hornstein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Fit statistics for scored XGBoost models with 50,000 rows per dataset.

  2. UAE Used Cars Analysis - Full Project v1.0

    • kaggle.com
    zip
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohamed saad 254 (2025). UAE Used Cars Analysis - Full Project v1.0 [Dataset]. https://www.kaggle.com/datasets/mohamedsaad254/uae-used-cars-analysis-full-project-v1-0/code
    Explore at:
    zip(17351496 bytes)Available download formats
    Dataset updated
    Mar 10, 2025
    Authors
    mohamed saad 254
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United Arab Emirates
    Description

    UAE Used Cars Analysis - Full Project v1.0

    A Complete End-to-End Solution for Analyzing and Predicting Used Car Prices in the UAE

    Overview

    This dataset provides a compressed ZIP archive of the "UAE Used Cars Analysis" project, featuring 10,000 used car listings with precise location data (covering cities like Dubai, Abu Dhabi, and Sharjah), source code for a Dash-based web application, and a trained XGBoost model. It is designed for data scientists, analysts, and automotive enthusiasts to explore regional market trends, predict car prices, and visualize geospatial insights.

    Contents

    • data/uae_used_cars_10k.csv: Dataset with 10,000 records, including a Location column (e.g., Dubai, Abu Dhabi, Sharjah).
    • models/:
      • stacking_model.pkl: Trained XGBoost model.
      • scaler.pkl: Preprocessing scaler.
      • models.py: Model-related functions.
    • app.py: Main Dash application file.
    • callbacks.py: Interactive callbacks for the dashboard.
    • layouts.py: UI layout definitions.
    • train_model.py: Model training script.
    • utils.py: Utility functions.
    • requirements.txt: Required Python libraries.
    • README.md: Project documentation.

    Usage Instructions

    1. Download: Extract the ZIP using WinRAR or 7-Zip.
    2. Setup: Install dependencies: pip install -r requirements.txt.
    3. Run: Execute python app.py and access the app at http://127.0.0.1:8050/.
    4. Explore: Analyze data by location, predict prices, or train models locally.

    Dataset Details

    The dataset (uae_used_cars_10k.csv) includes: - Make: Car brand (e.g., Toyota). - Model: Car model (e.g., Camry). - Year: Manufacturing year. - Mileage: Distance driven in miles. - Cylinders: Number of engine cylinders. - Price: Sale price in AED. - Transmission: Automatic or Manual. - Fuel Type: Petrol, Diesel, etc. - Color: Exterior color. - Description: Seller's description. - Location: City of sale (e.g., Dubai, Abu Dhabi, Sharjah).

    Notes

    • Numeric columns (Mileage, Cylinders) may contain missing values; imputation recommended.
    • Mileage is in miles; convert to kilometers if needed (Mileage_Km = Mileage * 1.60934).

    Prerequisites

    • Python 3.8+
    • Libraries in requirements.txt (e.g., Dash, XGBoost, dash-leaflet for maps).

    Source

    Data aggregated from UAE car platforms in March 2025.

    Related Resources

    Last Updated: March 9, 2025 | Version 1.0 | Author: Mohammed Saad

  3. XGBoost (2.0.3) whl

    • kaggle.com
    zip
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2024). XGBoost (2.0.3) whl [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/xgboost-2-0-0-whl/code
    Explore at:
    zip(1187991597 bytes)Available download formats
    Dataset updated
    Jan 4, 2024
    Authors
    Carl McBride Ellis
    Description

    This is the whl file for XGBoost version 2.0.0 (released 12th September 2023)
    Update: This is the whl file for XGBoost version 2.0.2 (released 13th November 2023)
    Update: This is the whl file for XGBoost version 2.0.3 (released 19th December 2023)

    Installation: attach this dataset to ones notebook, then:

    !pip install -q /kaggle/input/xgboost-2-0-0-whl/xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl
    
    import xgboost as xgb
    
    xgb._version_
    

    License: Apache Software License (Apache-2.0)

  4. f

    Performance results for the XGBoost calibrated model using clinical...

    • datasetcatalog.nlm.nih.gov
    Updated Apr 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palos, Carlos; Fernandes, Marta; Leite, Francisca; Celi, Leo Anthony; Horng, Steven; Vieira, Susana M.; Johnson, Alistair; Mendes, Rúben; Finkelstein, Stan (2020). Performance results for the XGBoost calibrated model using clinical variables and chief complaint against the reference model (triage priority) and respective hyperparameters. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000500932
    Explore at:
    Dataset updated
    Apr 2, 2020
    Authors
    Palos, Carlos; Fernandes, Marta; Leite, Francisca; Celi, Leo Anthony; Horng, Steven; Vieira, Susana M.; Johnson, Alistair; Mendes, Rúben; Finkelstein, Stan
    Description

    AUROC was the performance measure for hyperparameter tuning and best model selection in train. The hyperparameters not mentioned in the table were the default in XGBClassifier from Python version 3.7.

  5. A

    Python script for training machine leanrning models for glass density...

    • agh.rodbuk.pl
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paweł Stoch; Paweł Stoch (2025). Python script for training machine leanrning models for glass density prediction [Dataset]. http://doi.org/10.58032/AGH/WY0GEJ
    Explore at:
    application/x-ipynb+json(123454), txt(1056), zip(2747711), application/x-ipynb+json(130989), application/x-ipynb+json(2224507), application/x-ipynb+json(1194193)Available download formats
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    AGH University of Krakow
    Authors
    Paweł Stoch; Paweł Stoch
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Accurately predicting glass density is crucial for designing novel materials. This study aims to develop a robust predictive model for the density of oxide glasses and, more im-portantly, to investigate how physically-informed feature engineering can create accurate and interpretable models that reveal underlying physical principles. Using a dataset of 76,593 oxide glasses from the SciGlass database, three ML models (ElasticNet, XGBoost, MLP) were trained and evaluated. Four distinct feature sets were constructed with increasing physical complexity, ranging from simple elemental composition to the advanced Magpie descriptors. The best model was further analyzed for interpretability using feature importance and SHAP analysis. A clear hierarchical improvement in predictive accuracy was observed with increasing feature sophistication across all models. The XGBoost model combined with the Magpie feature set provided the best performance, achieving a coefficient of determination (R2) of 0.97. Interpretability analysis revealed that the model's predictions were overwhelmingly driven by physical attributes, with mean atomic weight being the most influential predictor. The model learns to approximate the fundamental density equation using mean atomic weight as a proxy for molar mass and electronic structure features to estimate molar volume. This demonstrates that a data-driven approach can function as a scientifically valid and interpretable tool, accelerating the discovery of new materials.

  6. d

    Data from: Representative sample size for estimating saturated hydraulic...

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian (2024). Representative sample size for estimating saturated hydraulic conductivity via machine learning [Dataset]. https://search.dataone.org/view/sha256%3A1a7d2a59141f58fa9b927ab55cd6ad737474b2eb4419a6c568223c903760d00e
    Explore at:
    Dataset updated
    May 25, 2024
    Dataset provided by
    Hydroshare
    Authors
    Amin Ahmadisharaf; Reza Nematirad; Sadra Sabouri; Yakov Pachepsky; Behzad Ghanbarian
    Description

    This database including saturated hydraulic conductivity data from the USKSAT database as well as the associated Python codes used to analyze learning curves and train and test the developed machine learning models.

  7. u

    Dataset for 'Stream Temperature Predictions for River Basin Management in...

    • data.nceas.ucsb.edu
    • search.dataone.org
    • +1more
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan (2023). Dataset for 'Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning', Water 2022 [Dataset]. http://doi.org/10.15485/1854257
    Explore at:
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan
    Time period covered
    Jan 1, 1980 - Jun 30, 2021
    Area covered
    Description

    This data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript ‘Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning’, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.

  8. m

    Friction coefficient data of open-cell AlSi10Mg and AlSi10Mg-Al2O3 materials...

    • data.mendeley.com
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihail Kolev (2023). Friction coefficient data of open-cell AlSi10Mg and AlSi10Mg-Al2O3 materials with different pore sizes by pin-on-disk test and machine learning prediction [Dataset]. http://doi.org/10.17632/2356m76ktj.1
    Explore at:
    Dataset updated
    May 30, 2023
    Authors
    Mihail Kolev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data folders and files are organized in the repository as follows:

    Pin-on-disk_data folder contains three subfolders (Raw_files; Processed_files; COF_calculation). The Raw_files subfolder contains 12 DWF files with the raw data for each specimen. The Processed_files subfolder contains 12 XLSX files with the processed data for each specimen. The COF_calculation subfolder contains eight XLSX files with the average COF and time data for each material, two PNG files with the plots of the average COF vs time for each material pair, and one python script for calculating and visualizing the average COF and time data. Prediction folder contains three subfolders (Input_files; Output_data; Python_COF_prediction). The Input_files subfolder contains four XLSX files with the input data for the python script to make the predictions of COF vs sliding time for each material. The Output_data subfolder contains eight XLSX files with the actual and predicted values of COF for two different sets (test and validation) of each material, four TXT files with the performance metrics of the predicted COF for each material, and four PNG files with the plots of the actual vs predicted COF as a function of time for each material. The Python_COF_prediction subfolder contains one python script for making and evaluating the predictions of COF vs sliding time using a XGBoost model.

    The data were collected by performing dry wear tests at room temperature with linear velocity 0.5 m∙s−1, load 50 N and sliding time of 420 s. The labels of the specimen indicate the following:

    • AC_3_2, AC_3_3, AC_3_4, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg-Al2O3 composite with pore size of 800 ÷ 1000 μm (AC); • C_5_1, C_5_2, C_5_3, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg material with pore size of 800 ÷ 1000 μm (C); • AE_3_2, AE _4_1, AE _6_6, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg- Al2O3 composite with pore size of 1000 ÷ 1200 μm (AE); • E_3_1, E _6, E _6_3, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg material with pore size of 1000 ÷ 1200 μm (E).

  9. Fraudulent Financial Transaction Prediction

    • kaggle.com
    zip
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2025). Fraudulent Financial Transaction Prediction [Dataset]. https://www.kaggle.com/datasets/younusmohamed/fraudulent-financial-transaction-prediction
    Explore at:
    zip(41695207 bytes)Available download formats
    Dataset updated
    Feb 15, 2025
    Authors
    Younus_Mohamed
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Fraud Detection with Imbalanced Data

    Overview
    This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.

    Key Points
    - Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
    - Multiple Feature Files: Combine them by matching on id or Group.
    - Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
    - Goal: Predict which transactions in test_share.csv might be fraudulent.

    Files in this Dataset

    1. train.csv

      • Rows: 227,845 (example size)
      • Columns: 28
      • Description: Contains historical transaction data for training a fraud detection model.
      • Important: The Target column (0 = Clean, 1 = Fraud).
    2. test_share.csv

      • Rows: 56,962 (example size)
      • Columns: 27
      • Description: Test dataset, with the same structure as train.csv but without the Target column.
    3. Geo_scores.csv

      • Columns: (id, geo_score)
      • Description: Location-based geospatial scores for each transaction.
    4. Lambda_wts.csv

      • Columns: (Group, lambda_wt)
      • Description: Proprietary “lambda” weights associated with each Group.
    5. Qset_tats.csv

      • Columns: (id, qsets_normalized_tat)
      • Description: Network turn-around times (TAT) for each transaction.
    6. instance_scores.csv

      • Columns: (id, instance_scores)
      • Description: Vulnerability or risk qualification scores for each transaction.

    Suggested Usage

    1. Load all CSVs into dataframes.
    2. Merge additional files (Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group.
    3. Explore the severe class imbalance in train.csv (Target ~1% is fraud).
    4. Train any suitable classification model (Random Forest, XGBoost, etc.) on train.csv.
    5. Predict on test_share.csv or your own external data.

    Possible Tools:
    - Python: pandas, NumPy, scikit-learn
    - Imbalance Handling: SMOTE, Random Oversampler, or class weights
    - Metrics: Precision, Recall, F1-score, ROC-AUC, etc.

    Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!

    Tags

    • fraud-detection
    • classification
    • imbalanced-data
    • financial-transactions
    • machine-learning
    • python
    • beginner-friendly

    License: CC BY-NC-SA 4.0

  10. c

    Software Defects 1k Dataset

    • cubig.ai
    zip
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Software Defects 1k Dataset [Dataset]. https://cubig.ai/store/products/536/software-defects-1k-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Software Defects Dataset 1k contains 1,000 synthetic code functions written in seven programming languages, including Python, Java, JavaScript, C++, Go, and Rust, each of which is a dedicated dataset for software defect prediction labeled as buggy (1/0).

    2) Data Utilization (1) Software Defects Dataset 1k has characteristics that: • This dataset provides metrics such as actual function source code, programming language, lines_of_code, cyclic_complexity, as well as static analysis-based features such as AST token count, if statement/return statement/ function call count (num_ifs/num_returns/num_func_calls), and AST node count (ast_nodes). • For Python code, abstract syntax tree (AST)-based fine-analysis data is included, and other languages are replaced by token-based analysis. (2) Software Defects Dataset 1k can be used to: • Defect prediction modeling: applicable to the development and evaluation of code defect prediction models using traditional ML models (Random Forest, XGBoost) or LLMs (CodeT5, GPT-4). • Cross-lingual analysis: It can be used to study cross-lingual defect patterns by comparing and analyzing AST tokens, control statement patterns, etc. in multilingual codebases.

  11. f

    Output datasets from ML–assisted bibliometric workflow in African...

    • figshare.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Temitope Omogbene; Fikisiwe Gebashe; Ibraheem Lawal; Stephen Amoo; Adeyemi O. Aremu (2025). Output datasets from ML–assisted bibliometric workflow in African phytochemical metabolomics research [Dataset]. http://doi.org/10.6084/m9.figshare.30396481.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 19, 2025
    Dataset provided by
    figshare
    Authors
    Temitope Omogbene; Fikisiwe Gebashe; Ibraheem Lawal; Stephen Amoo; Adeyemi O. Aremu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection contains supplementary datasets generated during the machine learning–assisted bibliometric workflow for metabolomics and phytochemical research. The datasets represent sequential outputs derived from the integration and harmonisation of bibliographic metadata from Scopus, Web of Science (WoS), and Dimensions, processed via R and Python environments.The datasets were produced through distinct workflow stages:Dataset 1A (merged_dataset2.xlsx): Consolidated metadata produced in R from the merged raw bibliographic exports of Scopus, WoS, and Dimensions.Dataset 1B (sampled_data.xlsx): A stratified random sample generated in Python for pretraining and manual annotation.Dataset 1C (sample_data_pretrained.xlsx): Annotated sample dataset manually screened according to inclusion and exclusion criteria.Dataset 1D (highlighted_full_data_with_predictions.xlsx): The complete harmonised dataset automatically classified using the trained XGBoost model.Dataset 1E (absolute_metabolomics_data.xlsx): Final curated dataset of relevant records extracted from the ML-filtered corpus.Importantly, the file names of each dataset presented here were renamed from their original Google Drive file paths (referenced in the Python Google Colab scripts) to ensure sequential, descriptive, and logically ordered naming. This adjustment enhances clarity, reproducibility, and cross-reference consistency across all linked repositories.

  12. Preventive Maintenance for Marine Engines

    • kaggle.com
    zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
    Explore at:
    zip(436025 bytes)Available download formats
    Dataset updated
    Feb 12, 2025
    Authors
    Fijabi J. Adekunle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Preventive Maintenance for Marine Engines: Data-Driven Insights

    Introduction:

    Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

    Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

    Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

    Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

    Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

    Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

    Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

    Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.

  13. o

    Data from: Simulated wildfire burned area over the CONUS during 2001-2020

    • osti.gov
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang, Huilin; Liu, Ye (2024). Simulated wildfire burned area over the CONUS during 2001-2020 [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/2424127
    Explore at:
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Pacific Northwest National Laboratory 2
    DOE
    Authors
    Huang, Huilin; Liu, Ye
    Description

    Wildfires have shown increasing trends in both frequency and severity across the Contiguous United States (CONUS). However, process-based fire models have difficulties in accurately simulating the burned area over the CONUS due to a simplification of the physical process and cannot capture the interplay among fire, ignition, climate, and human activities. The deficiency of burned area simulation deteriorates the description of fire impact on energy balance, water budget, and carbon fluxes in the Earth System Models (ESMs). Alternatively, machine learning (ML) based fire models, which capture statistical relationships between the burned area and environmental factors, have shown promising burned area predictions and corresponding fire impact simulation. We develop a hybrid framework (ML4Fire-XGB) that integrates a pretrained eXtreme Gradient Boosting (XGBoost) wildfire model with the Energy Exascale Earth System Model (E3SM) land model (ELM). A Fortran-C-Python deep learning bridge is adapted to support online communication between ELM and the ML fire model. Specifically, the burned area predicted by the ML-based wildfire model is directly passed to ELM to adjust the carbon pool and vegetation dynamics after disturbance, which are then used as predictors in the ML-based fire model in the next time step. Evaluated against the historical burned area from Globalmore » Fire Emissions Database 5 from 2001-2020, the ML4Fire-XGB model outperforms process-based fire models in terms of spatial distribution and seasonal variations. Sensitivity analysis confirms that the ML4Fire-XGB well captures the responses of the burned area to rising temperatures. The ML4Fire-XGB model has proved to be a new tool for studying vegetation-fire interactions, and more importantly, enables seamless exploration of climate-fire feedback, working as an active component in E3SM.« less

  14. faiss_whl

    • kaggle.com
    zip
    Updated Oct 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Tisseur (2025). faiss_whl [Dataset]. https://www.kaggle.com/datasets/pierretisseur/faiss-whl
    Explore at:
    zip(63481896 bytes)Available download formats
    Dataset updated
    Oct 20, 2025
    Authors
    Pierre Tisseur
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    TO use offline in the notebook Faiss cpu

    XgBoost

    Imbalance XgBoost

    Optuna

  15. Heart Disease Risk Prediction Dataset

    • kaggle.com
    zip
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahatir Ahmed Tusher (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/mahatiratusher/heart-disease-risk-prediction-dataset
    Explore at:
    zip(1448235 bytes)Available download formats
    Dataset updated
    Feb 7, 2025
    Authors
    Mahatir Ahmed Tusher
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Heart Disease Risk Prediction Dataset

    Overview

    This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.

    The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.

    This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.

    Dataset Features

    Input Features

    Symptoms (Binary - Yes/No)

    1. Chest Pain (chest_pain): Presence of chest pain, a common symptom of heart disease.
    2. Shortness of Breath (shortness_of_breath): Difficulty breathing, often associated with heart conditions.
    3. Unexplained Fatigue (fatigue): Persistent tiredness without an obvious cause.
    4. Palpitations (palpitations): Irregular or rapid heartbeat.
    5. Dizziness/Fainting (dizziness): Episodes of lightheadedness or fainting.
    6. Swelling in Legs/Ankles (swelling): Swelling due to fluid retention, often linked to heart failure.
    7. Pain in Arm/Jaw/Neck/Back (radiating_pain): Radiating pain, a hallmark of angina or heart attacks.
    8. Cold Sweats & Nausea (cold_sweats): Symptoms commonly associated with acute cardiac events.

    Risk Factors (Binary - Yes/No or Continuous)

    1. Age (age): Patient's age in years (continuous variable).
    2. High Blood Pressure (hypertension): History of hypertension (Yes/No).
    3. High Cholesterol (cholesterol_high): Elevated cholesterol levels (Yes/No).
    4. Diabetes (diabetes): Diagnosis of diabetes (Yes/No).
    5. Smoking History (smoker): Whether the patient is a smoker (Yes/No).
    6. Obesity (obesity): Obesity status (Yes/No).
    7. Family History of Heart Disease (family_history): Family history of cardiovascular conditions (Yes/No).

    Output Label

    • Heart Disease Risk (risk_label): Binary label indicating the risk of heart disease:
      • 0: Low risk
      • 1: High risk

    Data Generation Process

    This dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example: - Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk. - Symptom patterns were modeled after clinical guidelines and research studies on heart disease.

    Sources of Inspiration

    The design of this dataset was inspired by the following resources:

    Books

    • "Harrison's Principles of Internal Medicine" by J. Larry Jameson et al.: A comprehensive resource on cardiovascular diseases and their symptoms.
    • "Mayo Clinic Cardiology" by Joseph G. Murphy et al.: Provides insights into heart disease risk factors and diagnostic criteria.

    Research Papers

    • Framingham Heart Study: A landmark study identifying key risk factors for cardiovascular disease.
    • American Heart Association (AHA) Guidelines: Recommendations for diagnosing and managing heart disease.

    Existing Datasets

    • UCI Heart Disease Dataset: A widely used dataset for heart disease prediction.
    • Kaggle’s Heart Disease datasets: Various datasets contributed by the community.

    Clinical Guidelines

    • Centers for Disease Control and Prevention (CDC): Information on heart disease symptoms and risk factors.
    • World Health Organization (WHO): Global statistics and risk factor analysis for cardiovascular diseases.

    Applications

    This dataset can be used for a variety of purposes:

    1. Machine Learning Research:

      • Train classification models (e.g., Logistic Regression, Random Forest, XGBoost) to predict heart disease risk.
      • Experiment with feature engineering, model tuning, and evaluation metrics like Accuracy, Precision, Recall, and ROC-AUC.
    2. Healthcare Analytics:

      • Identify key risk factors contributing to heart disease.
      • Develop decision support systems for early detection of cardiovascular risks.
    3. Educational Purposes:

      • Teach students and practitioners about predictive modeling in healthcare.
      • Demonstrate the importance of feature selection...
  16. Software Defects Dataset 1k

    • kaggle.com
    zip
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravikumar R N (2025). Software Defects Dataset 1k [Dataset]. https://www.kaggle.com/datasets/ravikumarrn/software-defects-dataset-1k
    Explore at:
    zip(8453 bytes)Available download formats
    Dataset updated
    Jun 16, 2025
    Authors
    Ravikumar R N
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    📦 Software Defects Multilingual Dataset with AST & Token Features

    This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.

    🙋 Please Citation

    If you use this dataset in your research or project, please cite it as:

    "Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."

    🧠 Dataset Highlights

    • Languages Included: Python, Java, JavaScript, C, C++, Go, Rust
    • Records: 1,000 code snippets
    • Labels: defect (1 = buggy, 0 = clean)
    • Features:

      • token_count: Total tokens (AST-based for Python)
      • num_ifs, num_returns, num_func_calls: Code structure features
      • ast_nodes: Number of nodes in the abstract syntax tree (Python only)
      • lines_of_code & cyclomatic_complexity: Simulated metrics for modeling

      📊 Columns Description

    ColumnDescription
    function_nameUnique identifier for the function
    codeThe actual function source code
    languageProgramming language used
    lines_of_codeApproximate number of lines in the function
    cyclomatic_complexitySimulated measure of decision complexity
    defect1 = buggy, 0 = clean
    token_countTotal token count (Python uses AST tokens)
    num_ifsCount of 'if' statements
    num_returnsCount of 'return' statements
    num_func_callsNumber of function calls
    ast_nodesAST node count (Python only, fallback = token count)

    🛠️ Usage Examples

    This dataset is suitable for:

    • Training traditional ML models like Random Forests or XGBoost
    • Evaluating prompt-based or fine-tuned LLMs (e.g., CodeT5, GPT-4)
    • Feature importance studies using AST and static code metrics
    • Cross-lingual transfer learning in code understanding

    📎** License**

    This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.

  17. f

    Spatiotemporal Soil Erosion Dataset for the Yarlung Tsangpo River Basin...

    • figshare.com
    zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    peng xin (2025). Spatiotemporal Soil Erosion Dataset for the Yarlung Tsangpo River Basin (1990–2100) [Dataset]. http://doi.org/10.6084/m9.figshare.29095763.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    figshare
    Authors
    peng xin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Yarlung Zangbo River
    Description

    This dataset includes historical erosion data (1990–2019), future soil erosion projections under SSP126, SSP245, and SSP585 scenarios (2021–2100), and predicted R and C factors for each period.Future R factors We incorporated 25 Global Climate Models (GCMs) from CMIP6 for calculating the future R factors, selected via the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) project (Table S3). The selection was based on the completeness of their time series and their alignment with the selected scenarios. Rainfall projections were corrected using quantile delta mapping (QDM) (Cannon et al., 2015) to address systematic biases in intensity distributions while preserving the projected trends in mean rainfall and extremes—critical for soil erosion analysis (Eekhout and de Vente, 2019). Bias correction was conducted using a 25-year baseline (1990–2014), with adjustments made monthly to correct for seasonal biases. The corrected bias functions were then applied to adjust the years (2020–2100) of daily rainfall data using the "ibicus" package, an open-source Python tool for bias adjustment and climate model evaluation. A minimum daily rainfall threshold of 0.1 mm was used to define rainy days, following established studies (Bulovic, 2024; Eekhout and de Vente, 2019; Switanek et al., 2017). Additionally, the study employed QDM to correct biases in historical GCM simulations, ensuring the applicability of the QDM method for rainfall bias correction in the YTRB. A baseline period of 1990–2010 was selected to establish the bias correction function, which was subsequently applied to adjust GCM simulations for 2011–2014. To evaluate the effectiveness of this calibration, we compared the annual mean precipitation from bias-corrected GCMs during 2011–2014 with observed precipitation data at the pixel level (Figs. S2, S3), using R² as the evaluation metric. The results showed a significant increase in R² after bias correction, confirming the effectiveness of the QDM approach. Future C factors To ensure the accuracy of the C factor predictions, we selected five CMIP6 climate models (table S4) with high spatial resolution compared to other CMIP6 climate models. Of the five selected climate models, CanESM5, IPSL-CM6-LR, and MIROC-ES2L have high equilibrium climate sensitivity (ECS) values. The ECS is the expected long-term warming after doubling of atmospheric CO2 concentrations. It is one of the most important indicators for understanding the impact of future warming (Rao et al., 2023). Therefore, we selected these five climate models with ECS values >3.0 to capture the full range of potential climate-induced changes affecting soil erosion. After selecting the climate models, we constructed an XGBoost model using historical C factor data and bioclimatic variables from the WorldClim data portal. WorldClim provides global gridded datasets with a 1 km² spatial resolution, including 19 bioclimatic variables derived from monthly temperature and precipitation data, reflecting annual trends, seasonality, and extreme environmental conditions (Hijmans et al., 2005). However, strong collinearity among the 19 bioclimatic variables and an excessive number of input features may increase model complexity and reduce XGBoost's predictive accuracy. To optimize performance, we employed Recursive Feature Elimination (RFE), an iterative method for selecting the most relevant features while preserving prediction accuracy. (Kornyo et al., 2023; Xiong et al., 2024). In each iteration, the current subset of features was used to train an XGBoost model, and feature importance was evaluated to remove the least significant variable, gradually refining the feature set. Using 80% of the data for training and 20% for testing, we employed 5-fold cross-validation to determine the feature subset that maximized the average R², ensuring optimal model performance. Additionally, a Genetic Algorithm (GA) was applied in each iteration to optimize the hyperparameters of the XGBoost model, which is crucial for enhancing both the efficiency and robustness of the model (Zhong and Liu, 2024; Zou et al., 2024). Finally, based on the variable selection results from RFE, the bioclimatic variables of future climate models were input into the trained XGBoost model to obtain the average C factor for the five selected climate models across four future periods (2020–2040, 2040–2060, 2060–2080, and 2080–2100). RUSLE model In this study, the mean annual soil loss was initially estimated using the RUSLE model, which enables us to estimate the spatial pattern of soil erosion (Renard et al.,1991). In areas where data are scarce, we consider RUSLE to be an effective runoff dependent soil erosion model because it requires only limited data for the study area (Haile et al.,2012).

  18. PythonLibraries|WheelFiles

    • kaggle.com
    zip
    Updated Mar 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravi Ramakrishnan (2024). PythonLibraries|WheelFiles [Dataset]. https://www.kaggle.com/datasets/ravi20076/pythonlibrarieswheelfiles/code
    Explore at:
    zip(1556654809 bytes)Available download formats
    Dataset updated
    Mar 25, 2024
    Authors
    Ravi Ramakrishnan
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Hello all,
    This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-

    PackageVersionFunctionality
    AutoGluon1.0.0AutoML models
    Catboost1.2.2
    1.2.3
    ML models
    Iterative-Stratification0.1.7Iterative stratification for multi-label classifiers
    Joblib1.3.2File dumping and retrieval
    LAMA0.3.8b1AutoML models
    LightGBM4.3.0
    4.2.0
    4.1.0
    ML models
    MAPIE0.8.2Quantile regression
    Numpy1.26.3Data wrangling
    Pandas2.1.4Data wrangling
    Polars0.20.3
    0.20.4
    Data wrangling
    PyTorch2.0.1Neural networks
    PyTorch-TabNet4.1.0Neural networks
    PyTorch-Forecast0.7.0Neural networks
    Pygwalker0.3.20Data wrangling and visualization
    Scikit-learn1.3.2
    1.4.0
    ML Models/ Pipelines/ Data wrangling
    Scipy1.11.4Data wrangling/ Statistics
    TabPFN10.1.9ML models
    Torch-Frame1.7.5Neural Networks
    TorchVision0.15.2Neural Networks
    XGBoost2.0.2
    2.0.1
    2.0.3
    ML models


    I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.

    Recent updates based on user feedback-

    1. lightgbm 4.1.0 and 4.3.0
    2. Older XGBoost versions (2.0.1 and 2.0.2)
    3. Torch-Frame, TabNet, PyTorch-Forecasting, TorchVision
    4. MAPIE
    5. LAMA 0.3.8b1
    6. Iterative-Stratification
    7. Catboost 1.2.3

    Best regards and happy learning and coding!

  19. Kaggle Blog: Winners' Posts

    • kaggle.com
    zip
    Updated Sep 21, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2016). Kaggle Blog: Winners' Posts [Dataset]. https://www.kaggle.com/kaggle/kaggle-blog-winners-posts
    Explore at:
    zip(530977 bytes)Available download formats
    Dataset updated
    Sep 21, 2016
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    In 2010, Kaggle launched its first competition, which was won by Jure Zbontar, who used a simple linear model. Since then a lot has changed. We've seen the rebirth of neural networks, the rise of Python, the creation of powerful libraries like XGBoost, Keras and Tensorflow.

    This is data set is a dump of all winners' posts from the Kaggle blog starting with Jure Zbontar. It allows us to track trends in the techniques, tools and libraries that win competitions.

    This is a simple dump. If there's demand, I can upload more detail (including comments and tags).

  20. Mean Annual Habitat Quality and Its Driving Variables in China (1990–2018)

    • figshare.com
    csv
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ChenXi Zhu; Pedro Cabral (2025). Mean Annual Habitat Quality and Its Driving Variables in China (1990–2018) [Dataset]. http://doi.org/10.6084/m9.figshare.29086178.v2
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 18, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    ChenXi Zhu; Pedro Cabral
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    This dataset uses the variables listed in Table 1 to train four machine learning models—Linear Regression, Decision Tree, Random Forest, and Extreme Gradient Boosting—to explain the mean annual habitat quality in China from 1990 to 2018. The best-performing model (XGBoost) achieved an R² of 0.8411, a mean absolute error (MAE) of 0.0862, and a root mean square error (RMSE) of 0.1341. All raster data were resampled to a 0.1º spatial resolution using bilinear interpolation and projected to the WGS 1984 World Mercator coordinate system.The dataset includes the following files:A CSV file containing the mean annual values of the dependent variable (habitat quality) and the independent variables across China from 1990 to 2018, based on the data listed in Table 1.(HQ: Habitat Quality; CZ: Climate Zone; FFI: Forest Fragmentation Index; GPP: Gross Primary Productivity; Light: Nighttime Lights; PRE: Mean Annual Precipitation Sum; ASP: Aspect; RAD: Solar Radiation; SLOPE: Slope; TEMP: Mean Annual Temperature; SM: Soil Moisture)A Python script used for modeling habitat quality, including mean encoding of the categorical variable climate zone (CZ), multicollinearity testing using Variance Inflation Factor (VIF), and implementation of four machine learning models to predict habitat quality.Table 1. Variables used in the machine learning modelsDatasetUnitsSourceHabitat Quality-Calculated based on landcover map(Yang and Huang, 2021)Gross Primary ProductivitygC m-2 d-1(Wang et al., 2021)TemperatureºC(Peng et al., 2019)Precipitation0.1mm(Peng et al., 2019)Downward shortwave radiationW m−2(He et al., 2020)Soil moisturem3 m−3(K. Zhang et al., 2024)Nighttime lightDigital Number(L. Zhang et al., 2024)Forest fragmentation index-Derived from landcover map (Yang & Huang, 2021)Digital Elevation Modelm(CGIAR-CSI, 2022)AspectDegreeDerived from DEM(CGIAR-CSI, 2022)SlopeDegreeDerived from DEM(CGIAR-CSI, 2022)Climate zones-(Kottek et al., 2006)ReferencesCGIAR-CSI. (2022). SRTM DEM dataset in China (2000). In National Tibetan Plateau Data Center. National Tibetan Plateau Data Center. https://dx.doi.org/He, J., Yang, K., Tang, W., Lu, H., Qin, J., Chen, Y., & Li, X. (2020). The first high-resolution meteorological forcing dataset for land process studies over China. Scientific Data, 7(1), 25. https://doi.org/10.1038/s41597-020-0369-yKottek, M., Grieser, J., Beck, C., Rudolf, B., & Rubel, F. (2006). World Map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift, 15(3), 259–263. https://doi.org/10.1127/0941-2948/2006/0130Peng, S., Ding, Y., Liu, W., & Li, Z. (2019). 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth System Science Data, 11(4), 1931–1946. https://doi.org/10.5194/essd-11-1931-2019Wang, S., Zhang, Y., Ju, W., Qiu, B., & Zhang, Z. (2021). Tracking the seasonal and inter-annual variations of global gross primary production during last four decades using satellite near-infrared reflectance data. Science of The Total Environment, 755, 142569. https://doi.org/10.1016/j.scitotenv.2020.142569Yang, J., & Huang, X. (2021). The 30 m annual land cover dataset and its dynamics in China from 1990 to 2019. Earth System Science Data, 13(8), 3907–3925. https://doi.org/10.5194/essd-13-3907-2021Zhang, K., Chen, H., Ma, N., Shang, S., Wang, Y., Xu, Q., & Zhu, G. (2024). A global dataset of terrestrial evapotranspiration and soil moisture dynamics from 1982 to 2020. Scientific Data, 11(1), 445. https://doi.org/10.1038/s41597-024-03271-7Zhang, L., Ren, Z., Chen, B., Gong, P., Xu, B., & Fu, H. (2024). A Prolonged Artificial Nighttime-light Dataset of China (1984-2020). Scientific Data, 11(1), 414. https://doi.org/10.1038/s41597-024-03223-1

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
John Prindle; Himal Suthar; Emily Putnam-Hornstein (2023). Fit statistics for scored XGBoost models with 50,000 rows per dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0291581.t002

Fit statistics for scored XGBoost models with 50,000 rows per dataset.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Oct 20, 2023
Dataset provided by
PLOS ONE
Authors
John Prindle; Himal Suthar; Emily Putnam-Hornstein
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Fit statistics for scored XGBoost models with 50,000 rows per dataset.

Search
Clear search
Close search
Google apps
Main menu