Each file corresponds to a training or test set of one of the four experiments described in the article. The columns (variables/features) are the same as the ones in the full dataset.
🚀 python package style code with package code on datasets - LightGBM and TabNet This is the code of training model and inference. Normally we use ipynb style code in kaggle. I just change the code style to py package and it's better for training with shell command.
I refer the original code below and thanks to @chumajin
[Notebook] Reference Notebook by chumajin
-- config : yaml file of parameter for lightgbm
-- models : saved model
-- train.py
-- predict test.py
-- feature_engineering.py
-- metric.py
-- preprocessing.py
-- seed.py
-- tabnet preprocessing.py
-- config : tabnet hyp.yaml / tabnet config.py
-- models : saved model
-- predict_test.py
-- train.py
I refer the original code below and thanks to @chumajin
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TLDR: 24 base regression and classification models, GBDT + NN, with their blend.
We trained all models (CatBoost and LGBM for regression, DenseLight and FT-Transformer for both regression and classification) with original and clipped target (clip all price values higher than 500k for training fold) using original and augmented with kagglex (added only in train fold) datasets (thanks to @lashfire).
Our final ensemble with 10-Fold CV scores is:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19099%2Ff15b1940fee181b446a50537a55450ae%2Finbox_597945_ec16e57f8d54df381cbc2ca8fcecb9d1_Final_solution2.png?generation=1725464647329188&alt=media" alt="">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance metrics for nine models in the training dataset.
In the repository you can find a variety of data and scripts to approximate the UTCI in southern South America and apply it to forecasts generated by data-driven models:1) UTCI data from ERA5-HEAT and different meteorological variables from ERA5.2) LightGBM models trained to estimate the UTCI from different predictors.3) Two examples sripts to train the LGBM models4) Scripts for metric estimation on the test sample of different LightGBM-based models with different predictors.5) Forecasts of the traditional GFS model, and data-driven models during a heat wave in central Argentina during March 2023.6) Scripts to apply the UTCI approach on the forecasts mentioned in the previous item.
This material is related to the article "Forecasting Heat Stress in southern South America from data-driven model outputs"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Samples calculated with ≈93 ms window and 50% overlap.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Solar energy generated from photovoltaic panel is an important energy source that brings many benefits to people and the environment. This is a growing trend globally and plays an increasingly important role in the future of the energy industry. However, it intermittent nature and potential for distributed system use require accurate forecasting to balance supply and demand, optimize energy storage, and manage grid stability. In this study, 5 machine learning models were used including: Gradient Boosting Regressor (GB), XGB Regressor (XGBoost), K-neighbors Regressor (KNN), LGBM Regressor (LightGBM), and CatBoost Regressor (CatBoost). Leveraging a dataset of 21045 samples, factors like Humidity, Ambient temperature, Wind speed, Visibility, Cloud ceiling and Pressure serve as inputs for constructing these machine learning models in forecasting solar energy. Model accuracy is meticulously assessed and juxtaposed using metrics such as coefficient of determination (R2), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). The results show that the CatBoost model emerges as the frontrunner in predicting solar energy, with training values of R2 value of 0.608, RMSE of 4.478 W and MAE of 3.367 W and the testing value is R2 of 0.46, RMSE of 4.748 W and MAE of 3.583 W. SHAP analysis reveal that ambient temperature and humidity have the greatest influences on the value solar energy generated from photovoltaic panel.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundIn this paper, we examine whether machine learning and deep learning can be used to predict difficult airway intubation in patients undergoing thyroid surgery.MethodsWe used 10 machine learning and deep learning algorithms to establish a corresponding model through a training group, and then verify the results in a test group. We used R for the statistical analysis and constructed the machine learning prediction model in Python.ResultsThe top 5 weighting factors for difficult airways identified by the average algorithm in machine learning were age, sex, weight, height, and BMI. In the training group, the AUC values and accuracy and the Gradient Boosting precision were 0.932, 0.929, and 100%, respectively. As for the modeled effects of predicting difficult airways in test groups, among the models constructed by the 10 algorithms, the three algorithms with the highest AUC values were Gradient Boosting, CNN, and LGBM, with values of 0.848, 0.836, and 0.812, respectively; In addition, among the algorithms, Gradient Boosting had the highest accuracy with a value of 0.913; Additionally, among the algorithms, the Gradient Boosting algorithm had the highest precision with a value of 100%.ConclusionAccording to our results, Gradient Boosting performed best overall, with an AUC >0.8, an accuracy >90%, and a precision of 100%. Besides, the top 5 weighting factors identified by the average algorithm in machine learning for difficult airways were age, sex, weight, height, and BMI.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Each file corresponds to a training or test set of one of the four experiments described in the article. The columns (variables/features) are the same as the ones in the full dataset.