15 datasets found
  1. Mau_LightGBM_Height_RFG_Level

    • kaggle.com
    zip
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataLeMur (2025). Mau_LightGBM_Height_RFG_Level [Dataset]. https://www.kaggle.com/datasets/saqifdtahmid/mau-lightgbm-height-rfg-level
    Explore at:
    zip(60589949 bytes)Available download formats
    Dataset updated
    Jul 16, 2025
    Authors
    DataLeMur
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    In this project, I focus on enhancing global building data by combining multiple open-source geospatial datasets to predict building attributes, specifically the number of levels (floors). The core datasets used are the Microsoft Open Buildings dataset, which provides detailed building footprints across many regions, and Google’s Temporal Buildings Dataset (V1), which includes estimated building heights over time derived from satellite imagery. While Google's dataset includes height information for many buildings, a significant portion contains missing or unreliable values.

    To address this, I first performed data preprocessing and merged the two datasets based on geographic coordinates. For buildings with missing height values, I used LightGBM, a gradient boosting framework, to impute missing heights using features like footprint area, geometry, and surrounding context. I then brought in OpenStreetMap (OSM) data to enrich the dataset with additional contextual information, such as building type, land use, and nearby infrastructure.

    Using the combined dataset — now with both original and imputed heights — I trained a Random Forest Regressor to predict the number of building levels. Since floor count is not always directly available, especially in developing regions, this approach offers a way to estimate it from height and footprint data with relatively high accuracy.

    This kind of modeling has important real-world applications. Predicting building levels can help support urban planning, disaster response, infrastructure development, and climate risk modeling. For example, knowing the number of floors in buildings allows for better estimation of population density, potential occupancy, or structural vulnerability in earthquake-prone or flood-prone regions. It can also help fill gaps in existing GIS data where traditional surveys are too expensive or time-consuming.

    In future work, this framework could be extended globally and refined with additional data sources like LIDAR or census information to further improve the accuracy and coverage of building-level models

  2. Table_3_Construction of diagnostic models for the progression of...

    • frontiersin.figshare.com
    xlsx
    Updated May 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Jiang; Ruilong Zhou; Fengle Jiang; Yanan Yan; Zheting Zhang; Jianmin Wang (2024). Table_3_Construction of diagnostic models for the progression of hepatocellular carcinoma using machine learning.xlsx [Dataset]. http://doi.org/10.3389/fonc.2024.1401496.s006
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 15, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Xin Jiang; Ruilong Zhou; Fengle Jiang; Yanan Yan; Zheting Zhang; Jianmin Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Liver cancer is one of the most prevalent forms of cancer worldwide. A significant proportion of patients with hepatocellular carcinoma (HCC) are diagnosed at advanced stages, leading to unfavorable treatment outcomes. Generally, the development of HCC occurs in distinct stages. However, the diagnostic and intervention markers for each stage remain unclear. Therefore, there is an urgent need to explore precise grading methods for HCC. Machine learning has emerged as an effective technique for studying precise tumor diagnosis. In this research, we employed random forest and LightGBM machine learning algorithms for the first time to construct diagnostic models for HCC at various stages of progression. We categorized 118 samples from GSE114564 into three groups: normal liver, precancerous lesion (including chronic hepatitis, liver cirrhosis, dysplastic nodule), and HCC (including early stage HCC and advanced HCC). The LightGBM model exhibited outstanding performance (accuracy = 0.96, precision = 0.96, recall = 0.96, F1-score = 0.95). Similarly, the random forest model also demonstrated good performance (accuracy = 0.83, precision = 0.83, recall = 0.83, F1-score = 0.83). When the progression of HCC was categorized into the most refined six stages: normal liver, chronic hepatitis, liver cirrhosis, dysplastic nodule, early stage HCC, and advanced HCC, the diagnostic model still exhibited high efficacy. Among them, the LightGBM model exhibited good performance (accuracy = 0.71, precision = 0.71, recall = 0.71, F1-score = 0.72). Also, performance of the LightGBM model was superior to that of the random forest model. Overall, we have constructed a diagnostic model for the progression of HCC and identified potential diagnostic characteristic gene for the progression of HCC.

  3. Tox24 challenge data

    • kaggle.com
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonina Dolgorukova (2024). Tox24 challenge data [Dataset]. https://www.kaggle.com/datasets/antoninadolgorukova/tox24-challenge-data/suggestions
    Explore at:
    zip(19160575 bytes)Available download formats
    Dataset updated
    Sep 18, 2024
    Authors
    Antonina Dolgorukova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and associated notebooks were created to solve the Tox24 Challenge and provide a real-world data and a working example of how machine learning can be used to predict binding activity to a target protein like Transthyretin (TTR) - from retrieving and preprocessing SMILES to ensembling the obtained predictions.

    SMILES: File all_smiles_data.csv contains various smiles for the 1512 competition chemicals (retrieved from pubchem, cleaned, and smiles with isolated atoms removed), generated in this notebook. Also, here I evaluated the performance of XGBoost using molecular descriptors computed from different SMILES representations of the chemicals.

    FEATURES: The 'features' folder contains features calculated with ochem and those computed in the features engineering notebook. All feature sets were evaluated with XGBoost here.

    Feature selection notebooks: - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-xgboost - https://www.kaggle.com/code/antoninadolgorukova/tox24-feature-selection-by-clusters-for-lightgbm

    MODELS: The 'submits' folder contains the predictions for the 500 test chemicals of the Tox24 Challenge were made with the XGBoost and LightGBM models and used for the final ensemble.

    DATA: The TTR Supplemental Tables are taken from the article that accompanied the Tox24 Challenge, and include:

    • Tables outlining the components of the assay reactions and lists of autofluorescent chemicals,
    • chemicals excluded from the analysis due to interference,
    • and chemicals screened in single concentration and concentration response testing.

    This dataset can be used for drug design research, protein-ligand interaction studies, and machine learning model development focused on chemical binding activities.

  4. Falhas em Rebocadores v3.2

    • kaggle.com
    zip
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jossian Brito (2025). Falhas em Rebocadores v3.2 [Dataset]. https://www.kaggle.com/datasets/jossianbrito/tug-failures
    Explore at:
    zip(361785 bytes)Available download formats
    Dataset updated
    Sep 24, 2025
    Authors
    Jossian Brito
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    What is this?

    Tug Failures v3.2 é um dataset sintético para treinar e avaliar modelos de classificação multiclasse de falhas a bordo de rebocadores portuários. Foi construído com lógica físico-operacional realista (sistemas de propulsão, hidráulico, elétrico, combustível) e contexto por porto (Suape, Santos, Ponta da Madeira, Paranaguá).
    Foco: segurança operacional → privilegiar recall (F2) nas classes de maior risco.

    Uso típico: baseline de ML + thresholds por classe (F2) + regras contextuais (ex.: WINCH em Suape/Paranaguá; ELEC em Santos).

    Files

    • tug_failures_dataset_v3_2.csv — 3k linhas, 7 classes de falha.

    Target

    • failure_flag — 1/0 (falha / não falha)
    • failure_class — uma entre:
      • AIR_PRESS_LOW, ELEC_BLACKOUT, FUEL_FILTER_CLOG, HEX_CLOG, HYD_PUMP_FAIL, OIL_PRESS_DROP, WINCH_BRAKE_WEAR

    Para classificação multiclasse, filtre failure_flag==1.

    Key features (exemplos)

    • Motores/sistemas: hyd_pressure_bar, lube_oil_pressure_bar, jacket_temp_c, aftercooler_temp_c, exhaust_temp_c, dp_fuel_kpa, air_bottle_bar, gen_load_pct, crankcase_press_kpa
    • Contexto: port, tow_mode, maneuver_density, swell_risk, winch_slack_events, maint_support_level, response_time_min
    • Identificação/tempo: tug_id, voyage_date
    • Versão/sinalizadores: version, etc.

    As distribuições e correlações refletem padrões de operação e falha típicos (ex.: WINCH_BRAKE_WEAR mais provável quando tow_mode=1 + swell/slack em Suape/Paranaguá; ELEC_BLACKOUT sob gen_load_pct alto em Santos com suporte baixo).

    Recommended baseline

    1. Split por grupo (tug_id) para evitar vazamento entre rebocadores (cenário realista).
    2. Modelo LGBM/GBDT com:
      • Numéricas: median imputer + standard scaler
      • Categóricas: most_frequent imputer + one-hot (ignore unknown)
    3. Thresholds por classe via F2 (recall-oriented) + regras contextuais:
      • WINCH: reduzir limiar quando port in {Suape,Paranagua} ou tow_mode=1 & (swell/slack | hyd_pressure_bar<148 | maneuver_density>0.6).
      • ELEC: favorecer em gen_load_pct>=88 com histórico de alarmes e suporte baixo/Santos.

    Example (Kaggle Notebook)

    !pip -q install lightgbm
    
    import glob
    CAND = glob.glob("/kaggle/input/**/tug_failures_dataset_v3_2.csv", recursive=True)
    CSV = CAND[0] if CAND else "/kaggle/working/tug_failures_dataset_v3_2.csv"
    print("CSV:", CSV)
    
    # (veja o notebook de baseline para o script completo de treino + thresholds + regras)
    
  5. f

    Table 1_Machine learning prediction of anxiety symptoms in social anxiety...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pack, Seung Pil; Hur, Ji-Won; Jung, Dooyoung; Cho, Chul-Hyun; Park, Jin-Hyun; Lee, Hwamin; Lee, Heon-Jeong; Shin, Yu-Bin (2025). Table 1_Machine learning prediction of anxiety symptoms in social anxiety disorder: utilizing multimodal data from virtual reality sessions.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001283930
    Explore at:
    Dataset updated
    Jan 7, 2025
    Authors
    Pack, Seung Pil; Hur, Ji-Won; Jung, Dooyoung; Cho, Chul-Hyun; Park, Jin-Hyun; Lee, Hwamin; Lee, Heon-Jeong; Shin, Yu-Bin
    Description

    IntroductionMachine learning (ML) is an effective tool for predicting mental states and is a key technology in digital psychiatry. This study aimed to develop ML algorithms to predict the upper tertile group of various anxiety symptoms based on multimodal data from virtual reality (VR) therapy sessions for social anxiety disorder (SAD) patients and to evaluate their predictive performance across each data type.MethodsThis study included 32 SAD-diagnosed individuals, and finalized a dataset of 132 samples from 25 participants. It utilized multimodal (physiological and acoustic) data from VR sessions to simulate social anxiety scenarios. This study employed extended Geneva minimalistic acoustic parameter set for acoustic feature extraction and extracted statistical attributes from time series-based physiological responses. We developed ML models that predict the upper tertile group for various anxiety symptoms in SAD using Random Forest, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost) models. The best parameters were explored through grid search or random search, and the models were validated using stratified cross-validation and leave-one-out cross-validation.ResultsThe CatBoost, using multimodal features, exhibited high performance, particularly for the Social Phobia Scale with an area under the receiver operating characteristics curve (AUROC) of 0.852. It also showed strong performance in predicting cognitive symptoms, with the highest AUROC of 0.866 for the Post-Event Rumination Scale. For generalized anxiety, the LightGBM’s prediction for the State-Trait Anxiety Inventory-trait led to an AUROC of 0.819. In the same analysis, models using only physiological features had AUROCs of 0.626, 0.744, and 0.671, whereas models using only acoustic features had AUROCs of 0.788, 0.823, and 0.754.ConclusionsThis study showed that a ML algorithm using integrated multimodal data can predict upper tertile anxiety symptoms in patients with SAD with higher performance than acoustic or physiological data obtained during a VR session. The results of this study can be used as evidence for personalized VR sessions and to demonstrate the strength of the clinical use of multimodal data.

  6. Table_1_Predicting superagers: a machine learning approach utilizing gut...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ha Eun Kim; Bori R. Kim; Sang Hi Hong; Seung Yeon Song; Jee Hyang Jeong; Geon Ha Kim (2024). Table_1_Predicting superagers: a machine learning approach utilizing gut microbiome features.DOCX [Dataset]. http://doi.org/10.3389/fnagi.2024.1444998.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Sep 9, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Ha Eun Kim; Bori R. Kim; Sang Hi Hong; Seung Yeon Song; Jee Hyang Jeong; Geon Ha Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveCognitive decline is often considered an inevitable aspect of aging; however, recent research has identified a subset of older adults known as “superagers” who maintain cognitive abilities comparable to those of younger individuals. Investigating the neurobiological characteristics associated with superior cognitive function in superagers is essential for understanding “successful aging.” Evidence suggests that the gut microbiome plays a key role in brain function, forming a bidirectional communication network known as the microbiome-gut-brain axis. Alterations in the gut microbiome have been linked to cognitive aging markers such as oxidative stress and inflammation. This study aims to investigate the unique patterns of the gut microbiome in superagers and to develop machine learning-based predictive models to differentiate superagers from typical agers.MethodsWe recruited 161 cognitively unimpaired, community-dwelling volunteers aged 60 years or from dementia prevention centers in Seoul, South Korea. After applying inclusion and exclusion criteria, 115 participants were included in the study. Following the removal of microbiome data outliers, 102 participants, comprising 57 superagers and 45 typical agers, were finally analyzed. Superagers were defined based on memory performance at or above average normative values of middle-aged adults. Gut microbiome data were collected from stool samples, and microbial DNA was extracted and sequenced. Relative abundances of bacterial genera were used as features for model development. We employed the LightGBM algorithm to build predictive models and utilized SHAP analysis for feature importance and interpretability.ResultsThe predictive model achieved an AUC of 0.832 and accuracy of 0.764 in the training dataset, and an AUC of 0.861 and accuracy of 0.762 in the test dataset. Significant microbiome features for distinguishing superagers included Alistipes, PAC001137_g, PAC001138_g, Leuconostoc, and PAC001115_g. SHAP analysis revealed that higher abundances of certain genera, such as PAC001138_g and PAC001115_g, positively influenced the likelihood of being classified as superagers.ConclusionOur findings demonstrate the machine learning-based predictive models using gut-microbiome features can differentiate superagers from typical agers with a reasonable performance.

  7. S1 Data -

    • plos.figshare.com
    csv
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huu Nam Nguyen; Quoc Thanh Tran; Canh Tung Ngo; Duc Dam Nguyen; Van Quan Tran (2025). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0315955.s001
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Huu Nam Nguyen; Quoc Thanh Tran; Canh Tung Ngo; Duc Dam Nguyen; Van Quan Tran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Solar energy generated from photovoltaic panel is an important energy source that brings many benefits to people and the environment. This is a growing trend globally and plays an increasingly important role in the future of the energy industry. However, it intermittent nature and potential for distributed system use require accurate forecasting to balance supply and demand, optimize energy storage, and manage grid stability. In this study, 5 machine learning models were used including: Gradient Boosting Regressor (GB), XGB Regressor (XGBoost), K-neighbors Regressor (KNN), LGBM Regressor (LightGBM), and CatBoost Regressor (CatBoost). Leveraging a dataset of 21045 samples, factors like Humidity, Ambient temperature, Wind speed, Visibility, Cloud ceiling and Pressure serve as inputs for constructing these machine learning models in forecasting solar energy. Model accuracy is meticulously assessed and juxtaposed using metrics such as coefficient of determination (R2), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). The results show that the CatBoost model emerges as the frontrunner in predicting solar energy, with training values of R2 value of 0.608, RMSE of 4.478 W and MAE of 3.367 W and the testing value is R2 of 0.46, RMSE of 4.748 W and MAE of 3.583 W. SHAP analysis reveal that ambient temperature and humidity have the greatest influences on the value solar energy generated from photovoltaic panel.

  8. The statistical values of the variables.

    • plos.figshare.com
    xls
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huu Nam Nguyen; Quoc Thanh Tran; Canh Tung Ngo; Duc Dam Nguyen; Van Quan Tran (2025). The statistical values of the variables. [Dataset]. http://doi.org/10.1371/journal.pone.0315955.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Huu Nam Nguyen; Quoc Thanh Tran; Canh Tung Ngo; Duc Dam Nguyen; Van Quan Tran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Solar energy generated from photovoltaic panel is an important energy source that brings many benefits to people and the environment. This is a growing trend globally and plays an increasingly important role in the future of the energy industry. However, it intermittent nature and potential for distributed system use require accurate forecasting to balance supply and demand, optimize energy storage, and manage grid stability. In this study, 5 machine learning models were used including: Gradient Boosting Regressor (GB), XGB Regressor (XGBoost), K-neighbors Regressor (KNN), LGBM Regressor (LightGBM), and CatBoost Regressor (CatBoost). Leveraging a dataset of 21045 samples, factors like Humidity, Ambient temperature, Wind speed, Visibility, Cloud ceiling and Pressure serve as inputs for constructing these machine learning models in forecasting solar energy. Model accuracy is meticulously assessed and juxtaposed using metrics such as coefficient of determination (R2), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). The results show that the CatBoost model emerges as the frontrunner in predicting solar energy, with training values of R2 value of 0.608, RMSE of 4.478 W and MAE of 3.367 W and the testing value is R2 of 0.46, RMSE of 4.748 W and MAE of 3.583 W. SHAP analysis reveal that ambient temperature and humidity have the greatest influences on the value solar energy generated from photovoltaic panel.

  9. f

    Data from: Accelerated Design for High-Entropy Alloys Based on Machine...

    • acs.figshare.com
    xlsx
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yingying Ma; Minjie Li; Yongkun Mu; Gang Wang; Wencong Lu (2023). Accelerated Design for High-Entropy Alloys Based on Machine Learning and Multiobjective Optimization [Dataset]. http://doi.org/10.1021/acs.jcim.3c00916.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    ACS Publications
    Authors
    Yingying Ma; Minjie Li; Yongkun Mu; Gang Wang; Wencong Lu
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    High-entropy alloys (HEAs) with high hardness and high ductility can be considered as candidates for wear-resistant applications. However, designing novel HEAs with multiple desired properties using traditional alloy design methods remains challenging due to the enormous composition space. In this work, we proposed a machine-learning-based framework to design HEAs with high Vickers hardness (H) and high compressive fracture strain (D). Initially, we constructed data sets containing 172,467 data with 161 features for D and H, respectively. Four-step feature selection was performed, with the selection of 12 and 8 features for the D and H prediction models based on the optimal algorithms of the support vector machine (SVR) and light gradient boosting machine (LightGBM), respectively. The R2 of the well-trained models reached 0.76 and 0.90 for the 10-fold cross validation. Nondominated sorting genetic algorithm version II (NSGA-II) and virtual screening were employed to search for the optimal alloying compositions, and four recommended candidates were synthesized to validate our methods. Notably, the D of three candidates have shown significant improvements compared to the samples with similar H in the original data sets, with increases of 135.8, 282.4, and 194.1% respectively. Analyzing the candidates, we have recommended suitable atomic percentage ranges for elements such as Al (2–14.8 at %), Nb (4–25 at %), and Mo (3–9.9 at %) in order to design HEAs with high hardness and ductility.

  10. f

    Table 4_Interpretable machine learning approach for TBM tunnel crown...

    • frontiersin.figshare.com
    docx
    Updated Jun 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wanrui Hu; Kai Wu; Heng Liu; Weibang Luo; Xingxing Li; Peng Guan (2025). Table 4_Interpretable machine learning approach for TBM tunnel crown convergence prediction with Bayesian optimization.docx [Dataset]. http://doi.org/10.3389/feart.2025.1608468.s004
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    Frontiers
    Authors
    Wanrui Hu; Kai Wu; Heng Liu; Weibang Luo; Xingxing Li; Peng Guan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate prediction of crown convergence in Tunnel Boring Machine (TBM) tunnels is critical for ensuring construction safety, optimizing support design, and improving construction efficiency. This study proposes an interpretable machine learning method based on Bayesian optimization (BO) and SHapley Additive exPlanations (SHAP) for predicting crown convergence (CC) in TBM tunnels. Firstly, a dataset comprising 1,501 samples was constructed using tunnel engineering data. Then, six classical ML models, namely, Support Vector Regression, Decision Tree, Random Forest, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting, and K-nearest neighbors—were developed, and BO was applied to tune the hyperparameters of each model to achieve accurate prediction of CC. Subsequently, the SHAP method was adopted to interpret the LightGBM model, quantifying the contribution of each input feature to the model’s predictions. The results indicate that the LightGBM model achieved the best prediction performance on the test set, with root mean squared error, mean absolute error, mean absolute percentage error, and determination coefficient values of 0.9122 mm, 0.6027 mm, 0.0644, and 0.9636, respectively; the average SHAP values for the six input features of the LightGBM model were ranked as follows: Time (0.1366) > Rock grade (0.0871) > Depth ratio (0.0528) > Still arch (0.0200) > Saturated compressive strength (0.0093) > Rock quality designation (0.0047). Validation using data from a TBM water conveyance tunnel in Xinjiang, China, confirmed the method’s practical utility, positioning it as an effective auxiliary tool for safer and more efficient TBM tunnel construction.

  11. Some features of the dataset from a bank.

    • plos.figshare.com
    xls
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HaiChao Du; Li Lv; Hongliang Wang; An Guo (2024). Some features of the dataset from a bank. [Dataset]. http://doi.org/10.1371/journal.pone.0294537.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    HaiChao Du; Li Lv; Hongliang Wang; An Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Credit card fraud is a significant problem that costs billions of dollars annually. Detecting fraudulent transactions is challenging due to the imbalance in class distribution, where the majority of transactions are legitimate. While pre-processing techniques such as oversampling of minority classes are commonly used to address this issue, they often generate unrealistic or overgeneralized samples. This paper proposes a method called autoencoder with probabilistic xgboost based on SMOTE and CGAN(AE-XGB-SMOTE-CGAN) for detecting credit card frauds.AE-XGB-SMOTE-CGAN is a novel method proposed for credit card fraud detection problems. The credit card fraud dataset comes from a real dataset anonymized by a bank and is highly imbalanced, with normal data far greater than fraud data. Autoencoder (AE) is used to extract relevant features from the dataset, enhancing the ability of feature representation learning, and are then fed into xgboost for classification according to the threshold. Additionally, in this study, we propose a novel approach that hybridizes Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to tackle class imbalance problems. Our two-phase oversampling approach involves knowledge transfer and leverages the synergies of SMOTE and GAN. Specifically, GAN transforms the unrealistic or overgeneralized samples generated by SMOTE into realistic data distributions where there is not enough minority class data available for GAN to process effectively on its own. SMOTE is used to address class imbalance issues and CGAN is used to generate new, realistic data to supplement the original dataset. The AE-XGB-SMOTE-CGAN algorithm is also compared to other commonly used machine learning algorithms, such as KNN and Light GBM, and shows an overall improvement of 2% in terms of the ACC index compared to these algorithms. The AE-XGB-SMOTE-CGAN algorithm also outperforms KNN in terms of the MCC index by 30% when the threshold is set to 0.35. This indicates that the AE-XGB-SMOTE-CGAN algorithm has higher accuracy, true positive rate, true negative rate, and Matthew’s correlation coefficient, making it a promising method for detecting credit card fraud.

  12. Performance of AE-XGB-SMOTE-CGAN with and without data augmentation.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HaiChao Du; Li Lv; Hongliang Wang; An Guo (2024). Performance of AE-XGB-SMOTE-CGAN with and without data augmentation. [Dataset]. http://doi.org/10.1371/journal.pone.0294537.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    HaiChao Du; Li Lv; Hongliang Wang; An Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of AE-XGB-SMOTE-CGAN with and without data augmentation.

  13. f

    Performance comparisons of AE-XGB-SMOTE-CGAN and related methods.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HaiChao Du; Li Lv; Hongliang Wang; An Guo (2024). Performance comparisons of AE-XGB-SMOTE-CGAN and related methods. [Dataset]. http://doi.org/10.1371/journal.pone.0294537.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    HaiChao Du; Li Lv; Hongliang Wang; An Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparisons of AE-XGB-SMOTE-CGAN and related methods.

  14. The summary table of performance metrics for the five algorithms.

    • plos.figshare.com
    xls
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huu Nam Nguyen; Quoc Thanh Tran; Canh Tung Ngo; Duc Dam Nguyen; Van Quan Tran (2025). The summary table of performance metrics for the five algorithms. [Dataset]. http://doi.org/10.1371/journal.pone.0315955.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Huu Nam Nguyen; Quoc Thanh Tran; Canh Tung Ngo; Duc Dam Nguyen; Van Quan Tran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The summary table of performance metrics for the five algorithms.

  15. Descriptive summary of final dataset with p-values.

    • plos.figshare.com
    xls
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Ahiduzzaman; Md Nahid Hasan (2025). Descriptive summary of final dataset with p-values. [Dataset]. http://doi.org/10.1371/journal.pone.0335915.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Md Ahiduzzaman; Md Nahid Hasan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptive summary of final dataset with p-values.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
DataLeMur (2025). Mau_LightGBM_Height_RFG_Level [Dataset]. https://www.kaggle.com/datasets/saqifdtahmid/mau-lightgbm-height-rfg-level
Organization logo

Mau_LightGBM_Height_RFG_Level

Multisource geospatial data integration for building height and level estimation

Explore at:
zip(60589949 bytes)Available download formats
Dataset updated
Jul 16, 2025
Authors
DataLeMur
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

In this project, I focus on enhancing global building data by combining multiple open-source geospatial datasets to predict building attributes, specifically the number of levels (floors). The core datasets used are the Microsoft Open Buildings dataset, which provides detailed building footprints across many regions, and Google’s Temporal Buildings Dataset (V1), which includes estimated building heights over time derived from satellite imagery. While Google's dataset includes height information for many buildings, a significant portion contains missing or unreliable values.

To address this, I first performed data preprocessing and merged the two datasets based on geographic coordinates. For buildings with missing height values, I used LightGBM, a gradient boosting framework, to impute missing heights using features like footprint area, geometry, and surrounding context. I then brought in OpenStreetMap (OSM) data to enrich the dataset with additional contextual information, such as building type, land use, and nearby infrastructure.

Using the combined dataset — now with both original and imputed heights — I trained a Random Forest Regressor to predict the number of building levels. Since floor count is not always directly available, especially in developing regions, this approach offers a way to estimate it from height and footprint data with relatively high accuracy.

This kind of modeling has important real-world applications. Predicting building levels can help support urban planning, disaster response, infrastructure development, and climate risk modeling. For example, knowing the number of floors in buildings allows for better estimation of population density, potential occupancy, or structural vulnerability in earthquake-prone or flood-prone regions. It can also help fill gaps in existing GIS data where traditional surveys are too expensive or time-consuming.

In future work, this framework could be extended globally and refined with additional data sources like LIDAR or census information to further improve the accuracy and coverage of building-level models

Search
Clear search
Close search
Google apps
Main menu