3 datasets found
  1. Lung cancer Bangladesh

    • kaggle.com
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NISHAT VASKER (2025). Lung cancer Bangladesh [Dataset]. http://doi.org/10.34740/kaggle/dsv/11035259
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2025
    Dataset provided by
    Kaggle
    Authors
    NISHAT VASKER
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Bangladesh
    Description

    About Dataset šŸ“Œ Overview This dataset has been carefully synthesized to support research in lung cancer survival prediction, enabling the development of models that estimate:

    Whether a patient is likely to survive at least one year post-diagnosis (Binary Classification). The probability of survival based on clinical and lifestyle factors (Regression Analysis). The dataset is designed for machine learning and deep learning applications in medical AI, oncology research, and predictive healthcare.

    šŸ“œ Dataset Generation Process The dataset was generated using a combination of real-world epidemiological insights, medical literature, and statistical modeling. The feature distributions and relationships have been carefully modeled to reflect real-world clinical scenarios, ensuring biomedical validity.

    šŸ“– Medical References & Sources The dataset structure is based on well-established lung cancer risk factors and survival indicators documented in leading medical research and clinical guidelines:

    World Health Organization (WHO) Reports on lung cancer epidemiology. National Cancer Institute (NCI) & American Cancer Society (ACS) guidelines on lung cancer risk factors and treatment outcomes. The IASLC Lung Cancer Staging Project (8th Edition): Standard reference for lung cancer staging. Harrison’s Principles of Internal Medicine (20th Edition): Provides an in-depth review of lung cancer diagnosis and treatment. Lung Cancer: Principles and Practice (2022, Oxford University Press): Clinical insights into lung cancer detection, treatment, and survival factors. šŸ”¬ Features of the Dataset Each record in the dataset represents an individual’s clinical condition, lifestyle risk factors, and survival outcome. The dataset includes the following features:

    1ļøāƒ£ Patient Demographics Age → A key risk factor for lung cancer progression and survival. Gender → Male and female lung cancer survival rates can differ. Residence → Urban vs. Rural (impact of environmental factors). 2ļøāƒ£ Risk Factors & Lifestyle Indicators These factors have been linked to lung cancer risk in epidemiological studies:

    Smoking Status → (Current Smoker, Former Smoker, Never Smoked). Air Pollution Exposure → (Low, Moderate, High). Biomass Fuel Use → (Yes/No) – Associated with household air pollution. Factory Exposure → (Yes/No) – Industrial exposure increases lung cancer risk. Family History → (Yes/No) – Genetic predisposition to lung cancer. Diet Habit → (Vegetarian, Non-Vegetarian, Mixed) – Nutritional impact on cancer progression. 3ļøāƒ£ Symptoms (Primary Predictors) These are key clinical indicators associated with lung cancer detection and severity:

    Hemoptysis (Coughing Blood) Chest Pain Fatigue & Weakness Chronic Cough Unexplained Weight Loss 4ļøāƒ£ Tumor Characteristics & Clinical Features Tumor Size (mm) → The size of the detected tumor. Histology Type → (Adenocarcinoma, Squamous Cell Carcinoma, Small Cell Carcinoma). Cancer Stage → (Stage I to Stage IV). 5ļøāƒ£ Treatment & Healthcare Facility Treatment Received → (Surgery, Chemotherapy, Radiation, Targeted Therapy). Hospital Type → (Private, Government, Medical College). 6ļøāƒ£ Target Variables (Predicted Outcomes) Survival (Binary) → 1 (Yes) if the patient survives at least 1 year, 0 (No) otherwise. Survival Probability (%) (Can be derived) → Estimated probability of survival within one year. ⚔ Why This Dataset is Valuable? āœ… Balanced Data Distribution Designed to ensure a representative distribution of lung cancer survival cases. Prevents model bias and improves generalization in predictive models. āœ… Medically-Inspired Feature Engineering Features are derived from real-world lung cancer risk factors, validated through medical literature. Incorporates both lifestyle and clinical indicators to enhance predictive accuracy.(no real person data is used,just have made an biomedical environment) āœ… Diverse Risk Factors Considered Smoking, air pollution, and genetic history as primary lung cancer contributors. Symptom severity and tumor histology influence survival rates. āœ… Scalability & ML Suitability Ideal for classification and regression tasks in machine learning. Can be used with deep learning (TensorFlow, PyTorch), ML models (XGBoost, Random Forest, SVM), and explainable AI techniques like SHAP and LIME. šŸ“‚ Dataset Usage & Applications This dataset is highly useful for multiple healthcare AI applications, including:

    🩺 Predictive Analytics → Early detection of high-risk lung cancer patients. šŸ¤– Healthcare Chatbots → AI-powered risk assessment tools.

  2. f

    Data from: Melanoma Proteomics Unveiled: Harmonizing Diverse Data Sets for...

    • acs.figshare.com
    xlsx
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Áron Bartha; Boglárka Weltz; Lazaro Hiram Betancourt; Jeovanis Gil; Natália Pinto de Almeida; Giampaolo Bianchini; Beáta Szeitz; Leticia Szadai; Indira Pla; Lajos V. Kemény; Ágnes Judit Jánosi; Runyu Hong; Ahmad Rajeh; Fábio Nogueira; Viktória Doma; Nicole Woldmar; Jéssica Guedes; Zsuzsanna Újfaludi; Yonghyo Kim; Tibor Szarvas; Zoltan Pahi; Tibor Pankotai; A. Marcell Szasz; Aniel Sanchez; Bo Baldetorp; József Tímár; István Balázs Németh; Sarolta Kárpáti; Roger Appelqvist; Gilberto Barbosa Domont; Krzysztof Pawlowski; Elisabet Wieslander; Johan Malm; David Fenyo; Peter Horvatovich; György Marko-Varga; Balázs GyoĢ‹rffy (2025). Melanoma Proteomics Unveiled: Harmonizing Diverse Data Sets for Biomarker Discovery and Clinical Insights via MEL-PLOT [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00749.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    ACS Publications
    Authors
    Áron Bartha; Boglárka Weltz; Lazaro Hiram Betancourt; Jeovanis Gil; Natália Pinto de Almeida; Giampaolo Bianchini; Beáta Szeitz; Leticia Szadai; Indira Pla; Lajos V. Kemény; Ágnes Judit Jánosi; Runyu Hong; Ahmad Rajeh; Fábio Nogueira; Viktória Doma; Nicole Woldmar; Jéssica Guedes; Zsuzsanna Újfaludi; Yonghyo Kim; Tibor Szarvas; Zoltan Pahi; Tibor Pankotai; A. Marcell Szasz; Aniel Sanchez; Bo Baldetorp; József Tímár; István Balázs Németh; Sarolta Kárpáti; Roger Appelqvist; Gilberto Barbosa Domont; Krzysztof Pawlowski; Elisabet Wieslander; Johan Malm; David Fenyo; Peter Horvatovich; György Marko-Varga; Balázs GyoĢ‹rffy
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Using several melanoma proteomics data sets we created a single analysis platform that enables the discovery, knowledge build, and validation of diagnostic, predictive, and prognostic biomarkers at the protein level. Quantitative mass-spectrometry-based proteomic data was obtained from five independent cohorts, including 489 tissue samples from 394 patients with accompanying clinical metadata. We established an interactive R-based web platform that enables the comparison of protein levels across diverse cohorts, and supports correlation analysis between proteins and clinical metadata including survival outcomes. By comparing differential protein levels between metastatic, primary tumor, and nonmalignant samples in two of the cohorts, we identified 274 proteins showing significant differences among the sample types. Further analysis of these 274 proteins in lymph node metastatic samples from a third cohort revealed that 45 proteins exhibited a significant effect on patient survival. The three most significant proteins were HP (HR = 4.67, p = 2.8e-06), LGALS7 (HR = 3.83, p = 2.9e-05), and UBQLN1 (HR = 3.2, p = 4.8e-05). The user-friendly interactive web platform, accessible at https://www.tnmplot.com/melanoma, provides an interactive interface for the analysis of proteomic and clinical data. The MEL-PLOT platform, through its interactive capabilities, streamlines the creation of a comprehensive knowledge base, empowering hypothesis formulation and diligent monitoring of the most recent advancements in the domains of biomedical research and drug development.

  3. f

    Data_Sheet_1_Causal discovery in high-dimensional, multicollinear...

    • frontiersin.figshare.com
    pdf
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minxue Jia; Daniel Y. Yuan; Tyler C. Lovelace; Mengying Hu; Panayiotis V. Benos (2023). Data_Sheet_1_Causal discovery in high-dimensional, multicollinear datasets.PDF [Dataset]. http://doi.org/10.3389/fepid.2022.899655.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    Frontiers
    Authors
    Minxue Jia; Daniel Y. Yuan; Tyler C. Lovelace; Mengying Hu; Panayiotis V. Benos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As the cost of high-throughput genomic sequencing technology declines, its application in clinical research becomes increasingly popular. The collected datasets often contain tens or hundreds of thousands of biological features that need to be mined to extract meaningful information. One area of particular interest is discovering underlying causal mechanisms of disease outcomes. Over the past few decades, causal discovery algorithms have been developed and expanded to infer such relationships. However, these algorithms suffer from the curse of dimensionality and multicollinearity. A recently introduced, non-orthogonal, general empirical Bayes approach to matrix factorization has been demonstrated to successfully infer latent factors with interpretable structures from observed variables. We hypothesize that applying this strategy to causal discovery algorithms can solve both the high dimensionality and collinearity problems, inherent to most biomedical datasets. We evaluate this strategy on simulated data and apply it to two real-world datasets. In a breast cancer dataset, we identified important survival-associated latent factors and biologically meaningful enriched pathways within factors related to important clinical features. In a SARS-CoV-2 dataset, we were able to predict whether a patient (1) had COVID-19 and (2) would enter the ICU. Furthermore, we were able to associate factors with known COVID-19 related biological pathways.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NISHAT VASKER (2025). Lung cancer Bangladesh [Dataset]. http://doi.org/10.34740/kaggle/dsv/11035259
Organization logo

Lung cancer Bangladesh

Lung cancer of Bangladeshi people synthetic generated dataset.

Explore at:
17 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2025
Dataset provided by
Kaggle
Authors
NISHAT VASKER
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered
Bangladesh
Description

About Dataset šŸ“Œ Overview This dataset has been carefully synthesized to support research in lung cancer survival prediction, enabling the development of models that estimate:

Whether a patient is likely to survive at least one year post-diagnosis (Binary Classification). The probability of survival based on clinical and lifestyle factors (Regression Analysis). The dataset is designed for machine learning and deep learning applications in medical AI, oncology research, and predictive healthcare.

šŸ“œ Dataset Generation Process The dataset was generated using a combination of real-world epidemiological insights, medical literature, and statistical modeling. The feature distributions and relationships have been carefully modeled to reflect real-world clinical scenarios, ensuring biomedical validity.

šŸ“– Medical References & Sources The dataset structure is based on well-established lung cancer risk factors and survival indicators documented in leading medical research and clinical guidelines:

World Health Organization (WHO) Reports on lung cancer epidemiology. National Cancer Institute (NCI) & American Cancer Society (ACS) guidelines on lung cancer risk factors and treatment outcomes. The IASLC Lung Cancer Staging Project (8th Edition): Standard reference for lung cancer staging. Harrison’s Principles of Internal Medicine (20th Edition): Provides an in-depth review of lung cancer diagnosis and treatment. Lung Cancer: Principles and Practice (2022, Oxford University Press): Clinical insights into lung cancer detection, treatment, and survival factors. šŸ”¬ Features of the Dataset Each record in the dataset represents an individual’s clinical condition, lifestyle risk factors, and survival outcome. The dataset includes the following features:

1ļøāƒ£ Patient Demographics Age → A key risk factor for lung cancer progression and survival. Gender → Male and female lung cancer survival rates can differ. Residence → Urban vs. Rural (impact of environmental factors). 2ļøāƒ£ Risk Factors & Lifestyle Indicators These factors have been linked to lung cancer risk in epidemiological studies:

Smoking Status → (Current Smoker, Former Smoker, Never Smoked). Air Pollution Exposure → (Low, Moderate, High). Biomass Fuel Use → (Yes/No) – Associated with household air pollution. Factory Exposure → (Yes/No) – Industrial exposure increases lung cancer risk. Family History → (Yes/No) – Genetic predisposition to lung cancer. Diet Habit → (Vegetarian, Non-Vegetarian, Mixed) – Nutritional impact on cancer progression. 3ļøāƒ£ Symptoms (Primary Predictors) These are key clinical indicators associated with lung cancer detection and severity:

Hemoptysis (Coughing Blood) Chest Pain Fatigue & Weakness Chronic Cough Unexplained Weight Loss 4ļøāƒ£ Tumor Characteristics & Clinical Features Tumor Size (mm) → The size of the detected tumor. Histology Type → (Adenocarcinoma, Squamous Cell Carcinoma, Small Cell Carcinoma). Cancer Stage → (Stage I to Stage IV). 5ļøāƒ£ Treatment & Healthcare Facility Treatment Received → (Surgery, Chemotherapy, Radiation, Targeted Therapy). Hospital Type → (Private, Government, Medical College). 6ļøāƒ£ Target Variables (Predicted Outcomes) Survival (Binary) → 1 (Yes) if the patient survives at least 1 year, 0 (No) otherwise. Survival Probability (%) (Can be derived) → Estimated probability of survival within one year. ⚔ Why This Dataset is Valuable? āœ… Balanced Data Distribution Designed to ensure a representative distribution of lung cancer survival cases. Prevents model bias and improves generalization in predictive models. āœ… Medically-Inspired Feature Engineering Features are derived from real-world lung cancer risk factors, validated through medical literature. Incorporates both lifestyle and clinical indicators to enhance predictive accuracy.(no real person data is used,just have made an biomedical environment) āœ… Diverse Risk Factors Considered Smoking, air pollution, and genetic history as primary lung cancer contributors. Symptom severity and tumor histology influence survival rates. āœ… Scalability & ML Suitability Ideal for classification and regression tasks in machine learning. Can be used with deep learning (TensorFlow, PyTorch), ML models (XGBoost, Random Forest, SVM), and explainable AI techniques like SHAP and LIME. šŸ“‚ Dataset Usage & Applications This dataset is highly useful for multiple healthcare AI applications, including:

🩺 Predictive Analytics → Early detection of high-risk lung cancer patients. šŸ¤– Healthcare Chatbots → AI-powered risk assessment tools.

Search
Clear search
Close search
Google apps
Main menu