Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this project, I focus on enhancing global building data by combining multiple open-source geospatial datasets to predict building attributes, specifically the number of levels (floors). The core datasets used are the Microsoft Open Buildings dataset, which provides detailed building footprints across many regions, and Google’s Temporal Buildings Dataset (V1), which includes estimated building heights over time derived from satellite imagery. While Google's dataset includes height information for many buildings, a significant portion contains missing or unreliable values.
To address this, I first performed data preprocessing and merged the two datasets based on geographic coordinates. For buildings with missing height values, I used LightGBM, a gradient boosting framework, to impute missing heights using features like footprint area, geometry, and surrounding context. I then brought in OpenStreetMap (OSM) data to enrich the dataset with additional contextual information, such as building type, land use, and nearby infrastructure.
Using the combined dataset — now with both original and imputed heights — I trained a Random Forest Regressor to predict the number of building levels. Since floor count is not always directly available, especially in developing regions, this approach offers a way to estimate it from height and footprint data with relatively high accuracy.
This kind of modeling has important real-world applications. Predicting building levels can help support urban planning, disaster response, infrastructure development, and climate risk modeling. For example, knowing the number of floors in buildings allows for better estimation of population density, potential occupancy, or structural vulnerability in earthquake-prone or flood-prone regions. It can also help fill gaps in existing GIS data where traditional surveys are too expensive or time-consuming.
In future work, this framework could be extended globally and refined with additional data sources like LIDAR or census information to further improve the accuracy and coverage of building-level models
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Liver cancer is one of the most prevalent forms of cancer worldwide. A significant proportion of patients with hepatocellular carcinoma (HCC) are diagnosed at advanced stages, leading to unfavorable treatment outcomes. Generally, the development of HCC occurs in distinct stages. However, the diagnostic and intervention markers for each stage remain unclear. Therefore, there is an urgent need to explore precise grading methods for HCC. Machine learning has emerged as an effective technique for studying precise tumor diagnosis. In this research, we employed random forest and LightGBM machine learning algorithms for the first time to construct diagnostic models for HCC at various stages of progression. We categorized 118 samples from GSE114564 into three groups: normal liver, precancerous lesion (including chronic hepatitis, liver cirrhosis, dysplastic nodule), and HCC (including early stage HCC and advanced HCC). The LightGBM model exhibited outstanding performance (accuracy = 0.96, precision = 0.96, recall = 0.96, F1-score = 0.95). Similarly, the random forest model also demonstrated good performance (accuracy = 0.83, precision = 0.83, recall = 0.83, F1-score = 0.83). When the progression of HCC was categorized into the most refined six stages: normal liver, chronic hepatitis, liver cirrhosis, dysplastic nodule, early stage HCC, and advanced HCC, the diagnostic model still exhibited high efficacy. Among them, the LightGBM model exhibited good performance (accuracy = 0.71, precision = 0.71, recall = 0.71, F1-score = 0.72). Also, performance of the LightGBM model was superior to that of the random forest model. Overall, we have constructed a diagnostic model for the progression of HCC and identified potential diagnostic characteristic gene for the progression of HCC.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset and associated notebooks were created to solve the Tox24 Challenge and provide a real-world data and a working example of how machine learning can be used to predict binding activity to a target protein like Transthyretin (TTR) - from retrieving and preprocessing SMILES to ensembling the obtained predictions.
SMILES: File all_smiles_data.csv contains various smiles for the 1512 competition chemicals (retrieved from pubchem, cleaned, and smiles with isolated atoms removed), generated in this notebook. Also, here I evaluated the performance of XGBoost using molecular descriptors computed from different SMILES representations of the chemicals.
FEATURES: The 'features' folder contains features calculated with ochem and those computed in the features engineering notebook. All feature sets were evaluated with XGBoost here.
Feature selection notebooks: - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-lightgbm
MODELS: The 'submits' folder contains the predictions for the 500 test chemicals of the Tox24 Challenge were made with the XGBoost and LightGBM models and used for the final ensemble.
DATA: The TTR Supplemental Tables are taken from the article that accompanied the Tox24 Challenge, and include:
This dataset can be used for drug design research, protein-ligand interaction studies, and machine learning model development focused on chemical binding activities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tug Failures v3.2 é um dataset sintético para treinar e avaliar modelos de classificação multiclasse de falhas a bordo de rebocadores portuários. Foi construído com lógica físico-operacional realista (sistemas de propulsão, hidráulico, elétrico, combustível) e contexto por porto (Suape, Santos, Ponta da Madeira, Paranaguá).
Foco: segurança operacional → privilegiar recall (F2) nas classes de maior risco.
Uso típico: baseline de ML + thresholds por classe (F2) + regras contextuais (ex.: WINCH em Suape/Paranaguá; ELEC em Santos).
tug_failures_dataset_v3_2.csv — 3k linhas, 7 classes de falha.failure_flag — 1/0 (falha / não falha) failure_class — uma entre:
AIR_PRESS_LOW, ELEC_BLACKOUT, FUEL_FILTER_CLOG, HEX_CLOG, HYD_PUMP_FAIL, OIL_PRESS_DROP, WINCH_BRAKE_WEARPara classificação multiclasse, filtre
failure_flag==1.
hyd_pressure_bar, lube_oil_pressure_bar, jacket_temp_c, aftercooler_temp_c, exhaust_temp_c, dp_fuel_kpa, air_bottle_bar, gen_load_pct, crankcase_press_kpaport, tow_mode, maneuver_density, swell_risk, winch_slack_events, maint_support_level, response_time_mintug_id, voyage_date version, etc.As distribuições e correlações refletem padrões de operação e falha típicos (ex.: WINCH_BRAKE_WEAR mais provável quando
tow_mode=1+swell/slackem Suape/Paranaguá; ELEC_BLACKOUT sobgen_load_pctalto em Santos com suporte baixo).
tug_id) para evitar vazamento entre rebocadores (cenário realista). median imputer + standard scaler most_frequent imputer + one-hot (ignore unknown)port in {Suape,Paranagua} ou tow_mode=1 & (swell/slack | hyd_pressure_bar<148 | maneuver_density>0.6).gen_load_pct>=88 com histórico de alarmes e suporte baixo/Santos.!pip -q install lightgbm
import glob
CAND = glob.glob("/kaggle/input/**/tug_failures_dataset_v3_2.csv", recursive=True)
CSV = CAND[0] if CAND else "/kaggle/working/tug_failures_dataset_v3_2.csv"
print("CSV:", CSV)
# (veja o notebook de baseline para o script completo de treino + thresholds + regras)
Facebook
TwitterIntroductionMachine learning (ML) is an effective tool for predicting mental states and is a key technology in digital psychiatry. This study aimed to develop ML algorithms to predict the upper tertile group of various anxiety symptoms based on multimodal data from virtual reality (VR) therapy sessions for social anxiety disorder (SAD) patients and to evaluate their predictive performance across each data type.MethodsThis study included 32 SAD-diagnosed individuals, and finalized a dataset of 132 samples from 25 participants. It utilized multimodal (physiological and acoustic) data from VR sessions to simulate social anxiety scenarios. This study employed extended Geneva minimalistic acoustic parameter set for acoustic feature extraction and extracted statistical attributes from time series-based physiological responses. We developed ML models that predict the upper tertile group for various anxiety symptoms in SAD using Random Forest, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost) models. The best parameters were explored through grid search or random search, and the models were validated using stratified cross-validation and leave-one-out cross-validation.ResultsThe CatBoost, using multimodal features, exhibited high performance, particularly for the Social Phobia Scale with an area under the receiver operating characteristics curve (AUROC) of 0.852. It also showed strong performance in predicting cognitive symptoms, with the highest AUROC of 0.866 for the Post-Event Rumination Scale. For generalized anxiety, the LightGBM’s prediction for the State-Trait Anxiety Inventory-trait led to an AUROC of 0.819. In the same analysis, models using only physiological features had AUROCs of 0.626, 0.744, and 0.671, whereas models using only acoustic features had AUROCs of 0.788, 0.823, and 0.754.ConclusionsThis study showed that a ML algorithm using integrated multimodal data can predict upper tertile anxiety symptoms in patients with SAD with higher performance than acoustic or physiological data obtained during a VR session. The results of this study can be used as evidence for personalized VR sessions and to demonstrate the strength of the clinical use of multimodal data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveCognitive decline is often considered an inevitable aspect of aging; however, recent research has identified a subset of older adults known as “superagers” who maintain cognitive abilities comparable to those of younger individuals. Investigating the neurobiological characteristics associated with superior cognitive function in superagers is essential for understanding “successful aging.” Evidence suggests that the gut microbiome plays a key role in brain function, forming a bidirectional communication network known as the microbiome-gut-brain axis. Alterations in the gut microbiome have been linked to cognitive aging markers such as oxidative stress and inflammation. This study aims to investigate the unique patterns of the gut microbiome in superagers and to develop machine learning-based predictive models to differentiate superagers from typical agers.MethodsWe recruited 161 cognitively unimpaired, community-dwelling volunteers aged 60 years or from dementia prevention centers in Seoul, South Korea. After applying inclusion and exclusion criteria, 115 participants were included in the study. Following the removal of microbiome data outliers, 102 participants, comprising 57 superagers and 45 typical agers, were finally analyzed. Superagers were defined based on memory performance at or above average normative values of middle-aged adults. Gut microbiome data were collected from stool samples, and microbial DNA was extracted and sequenced. Relative abundances of bacterial genera were used as features for model development. We employed the LightGBM algorithm to build predictive models and utilized SHAP analysis for feature importance and interpretability.ResultsThe predictive model achieved an AUC of 0.832 and accuracy of 0.764 in the training dataset, and an AUC of 0.861 and accuracy of 0.762 in the test dataset. Significant microbiome features for distinguishing superagers included Alistipes, PAC001137_g, PAC001138_g, Leuconostoc, and PAC001115_g. SHAP analysis revealed that higher abundances of certain genera, such as PAC001138_g and PAC001115_g, positively influenced the likelihood of being classified as superagers.ConclusionOur findings demonstrate the machine learning-based predictive models using gut-microbiome features can differentiate superagers from typical agers with a reasonable performance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Solar energy generated from photovoltaic panel is an important energy source that brings many benefits to people and the environment. This is a growing trend globally and plays an increasingly important role in the future of the energy industry. However, it intermittent nature and potential for distributed system use require accurate forecasting to balance supply and demand, optimize energy storage, and manage grid stability. In this study, 5 machine learning models were used including: Gradient Boosting Regressor (GB), XGB Regressor (XGBoost), K-neighbors Regressor (KNN), LGBM Regressor (LightGBM), and CatBoost Regressor (CatBoost). Leveraging a dataset of 21045 samples, factors like Humidity, Ambient temperature, Wind speed, Visibility, Cloud ceiling and Pressure serve as inputs for constructing these machine learning models in forecasting solar energy. Model accuracy is meticulously assessed and juxtaposed using metrics such as coefficient of determination (R2), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). The results show that the CatBoost model emerges as the frontrunner in predicting solar energy, with training values of R2 value of 0.608, RMSE of 4.478 W and MAE of 3.367 W and the testing value is R2 of 0.46, RMSE of 4.748 W and MAE of 3.583 W. SHAP analysis reveal that ambient temperature and humidity have the greatest influences on the value solar energy generated from photovoltaic panel.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Solar energy generated from photovoltaic panel is an important energy source that brings many benefits to people and the environment. This is a growing trend globally and plays an increasingly important role in the future of the energy industry. However, it intermittent nature and potential for distributed system use require accurate forecasting to balance supply and demand, optimize energy storage, and manage grid stability. In this study, 5 machine learning models were used including: Gradient Boosting Regressor (GB), XGB Regressor (XGBoost), K-neighbors Regressor (KNN), LGBM Regressor (LightGBM), and CatBoost Regressor (CatBoost). Leveraging a dataset of 21045 samples, factors like Humidity, Ambient temperature, Wind speed, Visibility, Cloud ceiling and Pressure serve as inputs for constructing these machine learning models in forecasting solar energy. Model accuracy is meticulously assessed and juxtaposed using metrics such as coefficient of determination (R2), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). The results show that the CatBoost model emerges as the frontrunner in predicting solar energy, with training values of R2 value of 0.608, RMSE of 4.478 W and MAE of 3.367 W and the testing value is R2 of 0.46, RMSE of 4.748 W and MAE of 3.583 W. SHAP analysis reveal that ambient temperature and humidity have the greatest influences on the value solar energy generated from photovoltaic panel.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
High-entropy alloys (HEAs) with high hardness and high ductility can be considered as candidates for wear-resistant applications. However, designing novel HEAs with multiple desired properties using traditional alloy design methods remains challenging due to the enormous composition space. In this work, we proposed a machine-learning-based framework to design HEAs with high Vickers hardness (H) and high compressive fracture strain (D). Initially, we constructed data sets containing 172,467 data with 161 features for D and H, respectively. Four-step feature selection was performed, with the selection of 12 and 8 features for the D and H prediction models based on the optimal algorithms of the support vector machine (SVR) and light gradient boosting machine (LightGBM), respectively. The R2 of the well-trained models reached 0.76 and 0.90 for the 10-fold cross validation. Nondominated sorting genetic algorithm version II (NSGA-II) and virtual screening were employed to search for the optimal alloying compositions, and four recommended candidates were synthesized to validate our methods. Notably, the D of three candidates have shown significant improvements compared to the samples with similar H in the original data sets, with increases of 135.8, 282.4, and 194.1% respectively. Analyzing the candidates, we have recommended suitable atomic percentage ranges for elements such as Al (2–14.8 at %), Nb (4–25 at %), and Mo (3–9.9 at %) in order to design HEAs with high hardness and ductility.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate prediction of crown convergence in Tunnel Boring Machine (TBM) tunnels is critical for ensuring construction safety, optimizing support design, and improving construction efficiency. This study proposes an interpretable machine learning method based on Bayesian optimization (BO) and SHapley Additive exPlanations (SHAP) for predicting crown convergence (CC) in TBM tunnels. Firstly, a dataset comprising 1,501 samples was constructed using tunnel engineering data. Then, six classical ML models, namely, Support Vector Regression, Decision Tree, Random Forest, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting, and K-nearest neighbors—were developed, and BO was applied to tune the hyperparameters of each model to achieve accurate prediction of CC. Subsequently, the SHAP method was adopted to interpret the LightGBM model, quantifying the contribution of each input feature to the model’s predictions. The results indicate that the LightGBM model achieved the best prediction performance on the test set, with root mean squared error, mean absolute error, mean absolute percentage error, and determination coefficient values of 0.9122 mm, 0.6027 mm, 0.0644, and 0.9636, respectively; the average SHAP values for the six input features of the LightGBM model were ranked as follows: Time (0.1366) > Rock grade (0.0871) > Depth ratio (0.0528) > Still arch (0.0200) > Saturated compressive strength (0.0093) > Rock quality designation (0.0047). Validation using data from a TBM water conveyance tunnel in Xinjiang, China, confirmed the method’s practical utility, positioning it as an effective auxiliary tool for safer and more efficient TBM tunnel construction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Credit card fraud is a significant problem that costs billions of dollars annually. Detecting fraudulent transactions is challenging due to the imbalance in class distribution, where the majority of transactions are legitimate. While pre-processing techniques such as oversampling of minority classes are commonly used to address this issue, they often generate unrealistic or overgeneralized samples. This paper proposes a method called autoencoder with probabilistic xgboost based on SMOTE and CGAN(AE-XGB-SMOTE-CGAN) for detecting credit card frauds.AE-XGB-SMOTE-CGAN is a novel method proposed for credit card fraud detection problems. The credit card fraud dataset comes from a real dataset anonymized by a bank and is highly imbalanced, with normal data far greater than fraud data. Autoencoder (AE) is used to extract relevant features from the dataset, enhancing the ability of feature representation learning, and are then fed into xgboost for classification according to the threshold. Additionally, in this study, we propose a novel approach that hybridizes Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to tackle class imbalance problems. Our two-phase oversampling approach involves knowledge transfer and leverages the synergies of SMOTE and GAN. Specifically, GAN transforms the unrealistic or overgeneralized samples generated by SMOTE into realistic data distributions where there is not enough minority class data available for GAN to process effectively on its own. SMOTE is used to address class imbalance issues and CGAN is used to generate new, realistic data to supplement the original dataset. The AE-XGB-SMOTE-CGAN algorithm is also compared to other commonly used machine learning algorithms, such as KNN and Light GBM, and shows an overall improvement of 2% in terms of the ACC index compared to these algorithms. The AE-XGB-SMOTE-CGAN algorithm also outperforms KNN in terms of the MCC index by 30% when the threshold is set to 0.35. This indicates that the AE-XGB-SMOTE-CGAN algorithm has higher accuracy, true positive rate, true negative rate, and Matthew’s correlation coefficient, making it a promising method for detecting credit card fraud.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of AE-XGB-SMOTE-CGAN with and without data augmentation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparisons of AE-XGB-SMOTE-CGAN and related methods.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The summary table of performance metrics for the five algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive summary of final dataset with p-values.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this project, I focus on enhancing global building data by combining multiple open-source geospatial datasets to predict building attributes, specifically the number of levels (floors). The core datasets used are the Microsoft Open Buildings dataset, which provides detailed building footprints across many regions, and Google’s Temporal Buildings Dataset (V1), which includes estimated building heights over time derived from satellite imagery. While Google's dataset includes height information for many buildings, a significant portion contains missing or unreliable values.
To address this, I first performed data preprocessing and merged the two datasets based on geographic coordinates. For buildings with missing height values, I used LightGBM, a gradient boosting framework, to impute missing heights using features like footprint area, geometry, and surrounding context. I then brought in OpenStreetMap (OSM) data to enrich the dataset with additional contextual information, such as building type, land use, and nearby infrastructure.
Using the combined dataset — now with both original and imputed heights — I trained a Random Forest Regressor to predict the number of building levels. Since floor count is not always directly available, especially in developing regions, this approach offers a way to estimate it from height and footprint data with relatively high accuracy.
This kind of modeling has important real-world applications. Predicting building levels can help support urban planning, disaster response, infrastructure development, and climate risk modeling. For example, knowing the number of floors in buildings allows for better estimation of population density, potential occupancy, or structural vulnerability in earthquake-prone or flood-prone regions. It can also help fill gaps in existing GIS data where traditional surveys are too expensive or time-consuming.
In future work, this framework could be extended globally and refined with additional data sources like LIDAR or census information to further improve the accuracy and coverage of building-level models