Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fit statistics for scored XGBoost models with 50,000 rows per dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A Complete End-to-End Solution for Analyzing and Predicting Used Car Prices in the UAE
This dataset provides a compressed ZIP archive of the "UAE Used Cars Analysis" project, featuring 10,000 used car listings with precise location data (covering cities like Dubai, Abu Dhabi, and Sharjah), source code for a Dash-based web application, and a trained XGBoost model. It is designed for data scientists, analysts, and automotive enthusiasts to explore regional market trends, predict car prices, and visualize geospatial insights.
data/uae_used_cars_10k.csv: Dataset with 10,000 records, including a Location column (e.g., Dubai, Abu Dhabi, Sharjah).models/:
stacking_model.pkl: Trained XGBoost model.scaler.pkl: Preprocessing scaler.models.py: Model-related functions.app.py: Main Dash application file.callbacks.py: Interactive callbacks for the dashboard.layouts.py: UI layout definitions.train_model.py: Model training script.utils.py: Utility functions.requirements.txt: Required Python libraries.README.md: Project documentation.pip install -r requirements.txt.python app.py and access the app at http://127.0.0.1:8050/.The dataset (uae_used_cars_10k.csv) includes:
- Make: Car brand (e.g., Toyota).
- Model: Car model (e.g., Camry).
- Year: Manufacturing year.
- Mileage: Distance driven in miles.
- Cylinders: Number of engine cylinders.
- Price: Sale price in AED.
- Transmission: Automatic or Manual.
- Fuel Type: Petrol, Diesel, etc.
- Color: Exterior color.
- Description: Seller's description.
- Location: City of sale (e.g., Dubai, Abu Dhabi, Sharjah).
Mileage, Cylinders) may contain missing values; imputation recommended.Mileage is in miles; convert to kilometers if needed (Mileage_Km = Mileage * 1.60934).requirements.txt (e.g., Dash, XGBoost, dash-leaflet for maps).Data aggregated from UAE car platforms in March 2025.
Last Updated: March 9, 2025 | Version 1.0 | Author: Mohammed Saad
Facebook
TwitterThis is the whl file for XGBoost version 2.0.0 (released 12th September 2023)
Update: This is the whl file for XGBoost version 2.0.2 (released 13th November 2023)
Update: This is the whl file for XGBoost version 2.0.3 (released 19th December 2023)
Installation: attach this dataset to ones notebook, then:
!pip install -q /kaggle/input/xgboost-2-0-0-whl/xgboost-2.0.3-py3-none-manylinux2014_x86_64.whl
import xgboost as xgb
xgb._version_
License: Apache Software License (Apache-2.0)
Facebook
TwitterAUROC was the performance measure for hyperparameter tuning and best model selection in train. The hyperparameters not mentioned in the table were the default in XGBClassifier from Python version 3.7.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Accurately predicting glass density is crucial for designing novel materials. This study aims to develop a robust predictive model for the density of oxide glasses and, more im-portantly, to investigate how physically-informed feature engineering can create accurate and interpretable models that reveal underlying physical principles. Using a dataset of 76,593 oxide glasses from the SciGlass database, three ML models (ElasticNet, XGBoost, MLP) were trained and evaluated. Four distinct feature sets were constructed with increasing physical complexity, ranging from simple elemental composition to the advanced Magpie descriptors. The best model was further analyzed for interpretability using feature importance and SHAP analysis. A clear hierarchical improvement in predictive accuracy was observed with increasing feature sophistication across all models. The XGBoost model combined with the Magpie feature set provided the best performance, achieving a coefficient of determination (R2) of 0.97. Interpretability analysis revealed that the model's predictions were overwhelmingly driven by physical attributes, with mean atomic weight being the most influential predictor. The model learns to approximate the fundamental density equation using mean atomic weight as a proxy for molar mass and electronic structure features to estimate molar volume. This demonstrates that a data-driven approach can function as a scientifically valid and interpretable tool, accelerating the discovery of new materials.
Facebook
TwitterThis database including saturated hydraulic conductivity data from the USKSAT database as well as the associated Python codes used to analyze learning curves and train and test the developed machine learning models.
Facebook
TwitterThis data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript ‘Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning’, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data folders and files are organized in the repository as follows:
Pin-on-disk_data folder contains three subfolders (Raw_files; Processed_files; COF_calculation). The Raw_files subfolder contains 12 DWF files with the raw data for each specimen. The Processed_files subfolder contains 12 XLSX files with the processed data for each specimen. The COF_calculation subfolder contains eight XLSX files with the average COF and time data for each material, two PNG files with the plots of the average COF vs time for each material pair, and one python script for calculating and visualizing the average COF and time data. Prediction folder contains three subfolders (Input_files; Output_data; Python_COF_prediction). The Input_files subfolder contains four XLSX files with the input data for the python script to make the predictions of COF vs sliding time for each material. The Output_data subfolder contains eight XLSX files with the actual and predicted values of COF for two different sets (test and validation) of each material, four TXT files with the performance metrics of the predicted COF for each material, and four PNG files with the plots of the actual vs predicted COF as a function of time for each material. The Python_COF_prediction subfolder contains one python script for making and evaluating the predictions of COF vs sliding time using a XGBoost model.
The data were collected by performing dry wear tests at room temperature with linear velocity 0.5 m∙s−1, load 50 N and sliding time of 420 s. The labels of the specimen indicate the following:
• AC_3_2, AC_3_3, AC_3_4, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg-Al2O3 composite with pore size of 800 ÷ 1000 μm (AC); • C_5_1, C_5_2, C_5_3, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg material with pore size of 800 ÷ 1000 μm (C); • AE_3_2, AE _4_1, AE _6_6, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg- Al2O3 composite with pore size of 1000 ÷ 1200 μm (AE); • E_3_1, E _6, E _6_3, are three datasets used for pin-on-disk tests conducted with open-cell AlSi10Mg material with pore size of 1000 ÷ 1200 μm (E).
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Overview
This dataset is designed to help build, train, and evaluate machine learning models that detect fraudulent transactions. We have included additional CSV files containing location-based scores, proprietary weights for grouping, network turn-around times, and vulnerability scores.
Key Points
- Severe Class Imbalance: Only a tiny fraction (less than 1%) of transactions are fraud.
- Multiple Feature Files: Combine them by matching on id or Group.
- Target: The Target column in train.csv indicates fraud (1) vs. clean (0).
- Goal: Predict which transactions in test_share.csv might be fraudulent.
train.csv
Target column (0 = Clean, 1 = Fraud). test_share.csv
train.csv but without the Target column.Geo_scores.csv
Lambda_wts.csv
Group.Qset_tats.csv
instance_scores.csv
Geo_scores.csv, Lambda_wts.csv, etc.) by matching id or Group. train.csv (Target ~1% is fraud). train.csv. test_share.csv or your own external data. Possible Tools:
- Python: pandas, NumPy, scikit-learn
- Imbalance Handling: SMOTE, Random Oversampler, or class weights
- Metrics: Precision, Recall, F1-score, ROC-AUC, etc.
Beginner Tip: Check how these extra CSVs (Geo, lambda, instance scores, TAT) might improve fraud detection performance!
fraud-detection classification imbalanced-data financial-transactions machine-learning python beginner-friendlyLicense: CC BY-NC-SA 4.0
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Software Defects Dataset 1k contains 1,000 synthetic code functions written in seven programming languages, including Python, Java, JavaScript, C++, Go, and Rust, each of which is a dedicated dataset for software defect prediction labeled as buggy (1/0).
2) Data Utilization (1) Software Defects Dataset 1k has characteristics that: • This dataset provides metrics such as actual function source code, programming language, lines_of_code, cyclic_complexity, as well as static analysis-based features such as AST token count, if statement/return statement/ function call count (num_ifs/num_returns/num_func_calls), and AST node count (ast_nodes). • For Python code, abstract syntax tree (AST)-based fine-analysis data is included, and other languages are replaced by token-based analysis. (2) Software Defects Dataset 1k can be used to: • Defect prediction modeling: applicable to the development and evaluation of code defect prediction models using traditional ML models (Random Forest, XGBoost) or LLMs (CodeT5, GPT-4). • Cross-lingual analysis: It can be used to study cross-lingual defect patterns by comparing and analyzing AST tokens, control statement patterns, etc. in multilingual codebases.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains supplementary datasets generated during the machine learning–assisted bibliometric workflow for metabolomics and phytochemical research. The datasets represent sequential outputs derived from the integration and harmonisation of bibliographic metadata from Scopus, Web of Science (WoS), and Dimensions, processed via R and Python environments.The datasets were produced through distinct workflow stages:Dataset 1A (merged_dataset2.xlsx): Consolidated metadata produced in R from the merged raw bibliographic exports of Scopus, WoS, and Dimensions.Dataset 1B (sampled_data.xlsx): A stratified random sample generated in Python for pretraining and manual annotation.Dataset 1C (sample_data_pretrained.xlsx): Annotated sample dataset manually screened according to inclusion and exclusion criteria.Dataset 1D (highlighted_full_data_with_predictions.xlsx): The complete harmonised dataset automatically classified using the trained XGBoost model.Dataset 1E (absolute_metabolomics_data.xlsx): Final curated dataset of relevant records extracted from the ML-filtered corpus.Importantly, the file names of each dataset presented here were renamed from their original Google Drive file paths (referenced in the Python Google Colab scripts) to ensure sequential, descriptive, and logically ordered naming. This adjustment enhances clarity, reproducibility, and cross-reference consistency across all linked repositories.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Preventive Maintenance for Marine Engines: Data-Driven Insights
Introduction:
Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.
Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.
Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.
Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization
Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning
Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.
Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.
Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
Facebook
TwitterWildfires have shown increasing trends in both frequency and severity across the Contiguous United States (CONUS). However, process-based fire models have difficulties in accurately simulating the burned area over the CONUS due to a simplification of the physical process and cannot capture the interplay among fire, ignition, climate, and human activities. The deficiency of burned area simulation deteriorates the description of fire impact on energy balance, water budget, and carbon fluxes in the Earth System Models (ESMs). Alternatively, machine learning (ML) based fire models, which capture statistical relationships between the burned area and environmental factors, have shown promising burned area predictions and corresponding fire impact simulation. We develop a hybrid framework (ML4Fire-XGB) that integrates a pretrained eXtreme Gradient Boosting (XGBoost) wildfire model with the Energy Exascale Earth System Model (E3SM) land model (ELM). A Fortran-C-Python deep learning bridge is adapted to support online communication between ELM and the ML fire model. Specifically, the burned area predicted by the ML-based wildfire model is directly passed to ELM to adjust the carbon pool and vegetation dynamics after disturbance, which are then used as predictors in the ML-based fire model in the next time step. Evaluated against the historical burned area from Globalmore » Fire Emissions Database 5 from 2001-2020, the ML4Fire-XGB model outperforms process-based fire models in terms of spatial distribution and seasonal variations. Sensitivity analysis confirms that the ML4Fire-XGB well captures the responses of the burned area to rising temperatures. The ML4Fire-XGB model has proved to be a new tool for studying vegetation-fire interactions, and more importantly, enables seamless exploration of climate-fire feedback, working as an active component in E3SM.« less
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
TO use offline in the notebook Faiss cpu
XgBoost
Imbalance XgBoost
Optuna
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.
The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.
This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.
chest_pain): Presence of chest pain, a common symptom of heart disease.shortness_of_breath): Difficulty breathing, often associated with heart conditions.fatigue): Persistent tiredness without an obvious cause.palpitations): Irregular or rapid heartbeat.dizziness): Episodes of lightheadedness or fainting.swelling): Swelling due to fluid retention, often linked to heart failure.radiating_pain): Radiating pain, a hallmark of angina or heart attacks.cold_sweats): Symptoms commonly associated with acute cardiac events.age): Patient's age in years (continuous variable).hypertension): History of hypertension (Yes/No).cholesterol_high): Elevated cholesterol levels (Yes/No).diabetes): Diagnosis of diabetes (Yes/No).smoker): Whether the patient is a smoker (Yes/No).obesity): Obesity status (Yes/No).family_history): Family history of cardiovascular conditions (Yes/No).risk_label): Binary label indicating the risk of heart disease:
0: Low risk1: High riskThis dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example:
- Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk.
- Symptom patterns were modeled after clinical guidelines and research studies on heart disease.
The design of this dataset was inspired by the following resources:
This dataset can be used for a variety of purposes:
Machine Learning Research:
Healthcare Analytics:
Educational Purposes:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📦 Software Defects Multilingual Dataset with AST & Token Features
This repository provides a dataset of 1,000 synthetic code functions across multiple programming languages for the purpose of software defect prediction, multilingual static analysis, and LLM evaluation.
🙋 Please Citation
If you use this dataset in your research or project, please cite it as:
"Ravikumar R N, Software Defects Multilingual Dataset with AST Features (2025). Generated by synthetic methods for defect prediction and multilingual code analysis."
🧠 Dataset Highlights
defect (1 = buggy, 0 = clean)Features:
token_count: Total tokens (AST-based for Python)num_ifs, num_returns, num_func_calls: Code structure featuresast_nodes: Number of nodes in the abstract syntax tree (Python only)lines_of_code & cyclomatic_complexity: Simulated metrics for modeling📊 Columns Description
| Column | Description |
|---|---|
function_name | Unique identifier for the function |
code | The actual function source code |
language | Programming language used |
lines_of_code | Approximate number of lines in the function |
cyclomatic_complexity | Simulated measure of decision complexity |
defect | 1 = buggy, 0 = clean |
token_count | Total token count (Python uses AST tokens) |
num_ifs | Count of 'if' statements |
num_returns | Count of 'return' statements |
num_func_calls | Number of function calls |
ast_nodes | AST node count (Python only, fallback = token count) |
🛠️ Usage Examples
This dataset is suitable for:
📎** License**
This dataset is synthetic and licensed under CC BY 4.0. Feel free to use, share, or adapt it with proper attribution.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes historical erosion data (1990–2019), future soil erosion projections under SSP126, SSP245, and SSP585 scenarios (2021–2100), and predicted R and C factors for each period.Future R factors We incorporated 25 Global Climate Models (GCMs) from CMIP6 for calculating the future R factors, selected via the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) project (Table S3). The selection was based on the completeness of their time series and their alignment with the selected scenarios. Rainfall projections were corrected using quantile delta mapping (QDM) (Cannon et al., 2015) to address systematic biases in intensity distributions while preserving the projected trends in mean rainfall and extremes—critical for soil erosion analysis (Eekhout and de Vente, 2019). Bias correction was conducted using a 25-year baseline (1990–2014), with adjustments made monthly to correct for seasonal biases. The corrected bias functions were then applied to adjust the years (2020–2100) of daily rainfall data using the "ibicus" package, an open-source Python tool for bias adjustment and climate model evaluation. A minimum daily rainfall threshold of 0.1 mm was used to define rainy days, following established studies (Bulovic, 2024; Eekhout and de Vente, 2019; Switanek et al., 2017). Additionally, the study employed QDM to correct biases in historical GCM simulations, ensuring the applicability of the QDM method for rainfall bias correction in the YTRB. A baseline period of 1990–2010 was selected to establish the bias correction function, which was subsequently applied to adjust GCM simulations for 2011–2014. To evaluate the effectiveness of this calibration, we compared the annual mean precipitation from bias-corrected GCMs during 2011–2014 with observed precipitation data at the pixel level (Figs. S2, S3), using R² as the evaluation metric. The results showed a significant increase in R² after bias correction, confirming the effectiveness of the QDM approach. Future C factors To ensure the accuracy of the C factor predictions, we selected five CMIP6 climate models (table S4) with high spatial resolution compared to other CMIP6 climate models. Of the five selected climate models, CanESM5, IPSL-CM6-LR, and MIROC-ES2L have high equilibrium climate sensitivity (ECS) values. The ECS is the expected long-term warming after doubling of atmospheric CO2 concentrations. It is one of the most important indicators for understanding the impact of future warming (Rao et al., 2023). Therefore, we selected these five climate models with ECS values >3.0 to capture the full range of potential climate-induced changes affecting soil erosion. After selecting the climate models, we constructed an XGBoost model using historical C factor data and bioclimatic variables from the WorldClim data portal. WorldClim provides global gridded datasets with a 1 km² spatial resolution, including 19 bioclimatic variables derived from monthly temperature and precipitation data, reflecting annual trends, seasonality, and extreme environmental conditions (Hijmans et al., 2005). However, strong collinearity among the 19 bioclimatic variables and an excessive number of input features may increase model complexity and reduce XGBoost's predictive accuracy. To optimize performance, we employed Recursive Feature Elimination (RFE), an iterative method for selecting the most relevant features while preserving prediction accuracy. (Kornyo et al., 2023; Xiong et al., 2024). In each iteration, the current subset of features was used to train an XGBoost model, and feature importance was evaluated to remove the least significant variable, gradually refining the feature set. Using 80% of the data for training and 20% for testing, we employed 5-fold cross-validation to determine the feature subset that maximized the average R², ensuring optimal model performance. Additionally, a Genetic Algorithm (GA) was applied in each iteration to optimize the hyperparameters of the XGBoost model, which is crucial for enhancing both the efficiency and robustness of the model (Zhong and Liu, 2024; Zou et al., 2024). Finally, based on the variable selection results from RFE, the bioclimatic variables of future climate models were input into the trained XGBoost model to obtain the average C factor for the five selected climate models across four future periods (2020–2040, 2040–2060, 2060–2080, and 2080–2100). RUSLE model In this study, the mean annual soil loss was initially estimated using the RUSLE model, which enables us to estimate the spatial pattern of soil erosion (Renard et al.,1991). In areas where data are scarce, we consider RUSLE to be an effective runoff dependent soil erosion model because it requires only limited data for the study area (Haile et al.,2012).
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Hello all,
This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-
| Package | Version | Functionality |
|---|---|---|
| AutoGluon | 1.0.0 | AutoML models |
| Catboost | 1.2.2 1.2.3 | ML models |
| Iterative-Stratification | 0.1.7 | Iterative stratification for multi-label classifiers |
| Joblib | 1.3.2 | File dumping and retrieval |
| LAMA | 0.3.8b1 | AutoML models |
| LightGBM | 4.3.0 4.2.0 4.1.0 | ML models |
| MAPIE | 0.8.2 | Quantile regression |
| Numpy | 1.26.3 | Data wrangling |
| Pandas | 2.1.4 | Data wrangling |
| Polars | 0.20.3 0.20.4 | Data wrangling |
| PyTorch | 2.0.1 | Neural networks |
| PyTorch-TabNet | 4.1.0 | Neural networks |
| PyTorch-Forecast | 0.7.0 | Neural networks |
| Pygwalker | 0.3.20 | Data wrangling and visualization |
| Scikit-learn | 1.3.2 1.4.0 | ML Models/ Pipelines/ Data wrangling |
| Scipy | 1.11.4 | Data wrangling/ Statistics |
| TabPFN | 10.1.9 | ML models |
| Torch-Frame | 1.7.5 | Neural Networks |
| TorchVision | 0.15.2 | Neural Networks |
| XGBoost | 2.0.2 2.0.1 2.0.3 | ML models |
I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.
Best regards and happy learning and coding!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In 2010, Kaggle launched its first competition, which was won by Jure Zbontar, who used a simple linear model. Since then a lot has changed. We've seen the rebirth of neural networks, the rise of Python, the creation of powerful libraries like XGBoost, Keras and Tensorflow.
This is data set is a dump of all winners' posts from the Kaggle blog starting with Jure Zbontar. It allows us to track trends in the techniques, tools and libraries that win competitions.
This is a simple dump. If there's demand, I can upload more detail (including comments and tags).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset uses the variables listed in Table 1 to train four machine learning models—Linear Regression, Decision Tree, Random Forest, and Extreme Gradient Boosting—to explain the mean annual habitat quality in China from 1990 to 2018. The best-performing model (XGBoost) achieved an R² of 0.8411, a mean absolute error (MAE) of 0.0862, and a root mean square error (RMSE) of 0.1341. All raster data were resampled to a 0.1º spatial resolution using bilinear interpolation and projected to the WGS 1984 World Mercator coordinate system.The dataset includes the following files:A CSV file containing the mean annual values of the dependent variable (habitat quality) and the independent variables across China from 1990 to 2018, based on the data listed in Table 1.(HQ: Habitat Quality; CZ: Climate Zone; FFI: Forest Fragmentation Index; GPP: Gross Primary Productivity; Light: Nighttime Lights; PRE: Mean Annual Precipitation Sum; ASP: Aspect; RAD: Solar Radiation; SLOPE: Slope; TEMP: Mean Annual Temperature; SM: Soil Moisture)A Python script used for modeling habitat quality, including mean encoding of the categorical variable climate zone (CZ), multicollinearity testing using Variance Inflation Factor (VIF), and implementation of four machine learning models to predict habitat quality.Table 1. Variables used in the machine learning modelsDatasetUnitsSourceHabitat Quality-Calculated based on landcover map(Yang and Huang, 2021)Gross Primary ProductivitygC m-2 d-1(Wang et al., 2021)TemperatureºC(Peng et al., 2019)Precipitation0.1mm(Peng et al., 2019)Downward shortwave radiationW m−2(He et al., 2020)Soil moisturem3 m−3(K. Zhang et al., 2024)Nighttime lightDigital Number(L. Zhang et al., 2024)Forest fragmentation index-Derived from landcover map (Yang & Huang, 2021)Digital Elevation Modelm(CGIAR-CSI, 2022)AspectDegreeDerived from DEM(CGIAR-CSI, 2022)SlopeDegreeDerived from DEM(CGIAR-CSI, 2022)Climate zones-(Kottek et al., 2006)ReferencesCGIAR-CSI. (2022). SRTM DEM dataset in China (2000). In National Tibetan Plateau Data Center. National Tibetan Plateau Data Center. https://dx.doi.org/He, J., Yang, K., Tang, W., Lu, H., Qin, J., Chen, Y., & Li, X. (2020). The first high-resolution meteorological forcing dataset for land process studies over China. Scientific Data, 7(1), 25. https://doi.org/10.1038/s41597-020-0369-yKottek, M., Grieser, J., Beck, C., Rudolf, B., & Rubel, F. (2006). World Map of the Köppen-Geiger climate classification updated. Meteorologische Zeitschrift, 15(3), 259–263. https://doi.org/10.1127/0941-2948/2006/0130Peng, S., Ding, Y., Liu, W., & Li, Z. (2019). 1 km monthly temperature and precipitation dataset for China from 1901 to 2017. Earth System Science Data, 11(4), 1931–1946. https://doi.org/10.5194/essd-11-1931-2019Wang, S., Zhang, Y., Ju, W., Qiu, B., & Zhang, Z. (2021). Tracking the seasonal and inter-annual variations of global gross primary production during last four decades using satellite near-infrared reflectance data. Science of The Total Environment, 755, 142569. https://doi.org/10.1016/j.scitotenv.2020.142569Yang, J., & Huang, X. (2021). The 30 m annual land cover dataset and its dynamics in China from 1990 to 2019. Earth System Science Data, 13(8), 3907–3925. https://doi.org/10.5194/essd-13-3907-2021Zhang, K., Chen, H., Ma, N., Shang, S., Wang, Y., Xu, Q., & Zhu, G. (2024). A global dataset of terrestrial evapotranspiration and soil moisture dynamics from 1982 to 2020. Scientific Data, 11(1), 445. https://doi.org/10.1038/s41597-024-03271-7Zhang, L., Ren, Z., Chen, B., Gong, P., Xu, B., & Fu, H. (2024). A Prolonged Artificial Nighttime-light Dataset of China (1984-2020). Scientific Data, 11(1), 414. https://doi.org/10.1038/s41597-024-03223-1
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fit statistics for scored XGBoost models with 50,000 rows per dataset.