Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
What is Lung Cancer Dataset?
The effectiveness of the cancer prediction system helps people to know their cancer risk at a low cost and it also helps the people to take the appropriate decision based on their cancer risk status. The data is collected from the website online lung cancer prediction system.
.
https://user-images.githubusercontent.com/36210723/182395183-ef7519e3-9c18-47ac-b7a6-a00e234f3949.png" alt="2022-08-02_170741">
.
Acknowledgments
When we use this dataset in our research, we credit the authors as :
License : CC BY 4.0.
Hong, Z.Q. and Yang, J.Y. "Optimal Discriminant Plane for a Small Number of Samples and Design Method of Classifier on the Plane", Pattern Recognition, Vol. 24, No. 4, pp. 317-324, 1991 and it is published t to reuse in google research dataset
The main idea for uploading this dataset is to practice data analysis with my students, as I am working in college and want my student to train our studying ideas in a big dataset, It may be not up to date and I mention the collecting years, but it is a good resource of data to practice
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The lung cancer diagnostic tests market size was valued at USD 2.5 billion in 2023 and is projected to reach USD 6.1 billion by 2032, growing at a Compound Annual Growth Rate (CAGR) of 10.5% during the forecast period. This substantial growth can be attributed to the rising prevalence of lung cancer globally, advancements in diagnostic technologies, and increasing awareness regarding early detection and treatment of lung cancer. The growing aging population and the high incidence of smoking, which is a leading cause of lung cancer, further propel the demand for diagnostic tests.
The increasing prevalence of lung cancer is one of the primary drivers of market growth. Lung cancer remains the leading cause of cancer-related deaths worldwide, necessitating the development of more accurate and early diagnostic methods. With advancements in medical technology, such as molecular diagnostics and non-invasive imaging techniques, the accuracy and efficiency of lung cancer diagnosis have significantly improved. These innovations not only enhance the detection rate but also facilitate personalized treatment plans, thereby improving patient outcomes.
Furthermore, government initiatives and funding for cancer research play a crucial role in market expansion. Many countries are investing heavily in cancer research, leading to the development of new diagnostic tools and techniques. For instance, organizations such as the National Cancer Institute (NCI) in the United States provide substantial grants for lung cancer research, fostering innovations in diagnostics. In addition, public awareness campaigns and screening programs conducted by healthcare organizations and governments encourage early diagnosis, which is vital for successful treatment and survival rates.
The integration of artificial intelligence (AI) and machine learning in diagnostic tools is another significant factor contributing to market growth. AI algorithms can analyze medical images with high precision, aiding radiologists in identifying lung cancer at earlier stages. Moreover, AI-driven software can evaluate large datasets from genetic and molecular tests, providing insights into the most effective treatment options based on individual patient profiles. This technological advancement not only enhances the accuracy of diagnostics but also reduces the time required for analysis, thereby increasing the efficiency of healthcare services.
The EGFR Mutation Test is a pivotal advancement in the realm of lung cancer diagnostics, offering a more personalized approach to treatment. This test specifically identifies mutations in the Epidermal Growth Factor Receptor (EGFR) gene, which are often present in non-small cell lung cancer (NSCLC) patients. By detecting these mutations, healthcare providers can tailor therapies that target the specific genetic alterations, thereby improving treatment efficacy and patient outcomes. The growing adoption of EGFR Mutation Tests underscores the shift towards precision medicine, where treatments are increasingly customized based on individual genetic profiles. This approach not only enhances the effectiveness of therapies but also minimizes adverse effects, as treatments are more accurately aligned with the patient's unique genetic makeup.
Regionally, North America holds the largest share of the lung cancer diagnostic tests market, followed by Europe and Asia Pacific. The dominance of North America can be attributed to the presence of advanced healthcare infrastructure, high healthcare expenditure, and a robust research landscape. The Asia Pacific region, however, is expected to witness the highest growth rate during the forecast period, driven by increasing healthcare investments, growing awareness about lung cancer, and rising incidences of the disease in countries like China and India. The growing middle-class population and improving healthcare access in these countries further support market growth.
The lung cancer diagnostic tests market is segmented by test type into imaging tests, sputum cytology, tissue biopsy, molecular tests, and others. Imaging tests are one of the most commonly used diagnostic methods for lung cancer detection. Techniques such as X-rays, CT scans, and PET scans provide detailed visuals of the lungs, helping in identifying abnormal growths or tumors. The non-invasive nature of these tests and their ability to provide quick results make them a preferred choice among healthcare
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Objective: To identify the socioepidemiologic and histopathologic patterns of lung cancer patients in the Middle Euphrates region. Patients and Methods: This study analyzed medical information from lung cancer patients at the Middle Euphrates Cancer Center in Iraq from January 2018 to December 2023. Demographic information (age, gender, residency, and education level) as well as clinical details (histopathological categorization) were obtained. The inclusion criteria included all confirmed lung cancer cases, while cases with inadequate data or non-lung cancer diagnosis were omitted. The data were analyzed using IBM SPSS Statistics (version 26). The data summarized using descriptive statistics, and chi-square tests used to identify correlations between categorical variables at a significance level of p < 0.05. Ethical approval was obtained from the relevant institutional review board. Results: A total of 1162 patients were included with mean age at diagnosis(64.47±11.45) years. Majority of patients are over 60 years (64.4%), followed by (40–60 years), 34%, and the least affected group is under 40 years (1.6%). Males account for the majority of cases (68%), while females about 32%, with male:female ratio that fluctuate around 2:1. Illiterate patients and those with low education levels represent the largest proportion accounting for about 87.9% of the study population. Squamous Cell Carcinoma (SCC) is the most frequent subtype (41.7%), followed closely by Adenocarcinoma (AC) at 37%, and Small Cell Lung Cancer (SCLC), 10.5%. Although SCC is the predominant subtype overall, AC incidence is increasing overtime (from 31.7% in 2018 to 41.4% in 2023) with predominance in females, younger and higher educated groups. While the percentage of SCLC and other less common subgroups remained relatively stable over time, there is a significant reduction in NSCLC-NOS diagnoses (from 11.1% in 2018 to 3.2% in 2023). Conclusions: In Iraq, specifically in the Middle Euphrates region, lung cancer is a major public health issue in the elder age groups. The two main subtypes, SCC and AC, are the main contributors, with obvious increment in AC cases in the recent years. The shifting trends indicate the urgent need for improved screening strategies, focused preventative initiatives, and customized treatment plans in view of changing risk profiles.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Characteristic | Value (N = 26254) |
---|---|
Age (years) | Mean ± SD: 61.4± 5 Median (IQR): 60 (57-65) Range: 43-75 |
Sex | Male: 15512 (59%) Female: 10742 (41%) |
Race | White: 23969 (91.3%) |
Ethnicity | Not Available |
Background: The aggressive and heterogeneous nature of lung cancer has thwarted efforts to reduce mortality from this cancer through the use of screening. The advent of low-dose helical computed tomography (CT) altered the landscape of lung-cancer screening, with studies indicating that low-dose CT detects many tumors at early stages. The National Lung Screening Trial (NLST) was conducted to determine whether screening with low-dose CT could reduce mortality from lung cancer.
Methods: From August 2002 through April 2004, we enrolled 53,454 persons at high risk for lung cancer at 33 U.S. medical centers. Participants were randomly assigned to undergo three annual screenings with either low-dose CT (26,722 participants) or single-view posteroanterior chest radiography (26,732). Data were collected on cases of lung cancer and deaths from lung cancer that occurred through December 31, 2009. This dataset includes the low-dose CT scans from 26,254 of these subjects, as well as digitized histopathology images from 451 subjects.
Results: The rate of adherence to screening was more than 90%. The rate of positive screening tests was 24.2% with low-dose CT and 6.9% with radiography over all three rounds. A total of 96.4% of the positive screening results in the low-dose CT group and 94.5% in the radiography group were false positive results. The incidence of lung cancer was 645 cases per 100,000 person-years (1060 cancers) in the low-dose CT group, as compared with 572 cases per 100,000 person-years (941 cancers) in the radiography group (rate ratio, 1.13; 95% confidence interval [CI], 1.03 to 1.23). There were 247 deaths from lung cancer per 100,000 person-years in the low-dose CT group and 309 deaths per 100,000 person-years in the radiography group, representing a relative reduction in mortality from lung cancer with low-dose CT screening of 20.0% (95% CI, 6.8 to 26.7; P=0.004). The rate of death from any cause was reduced in the low-dose CT group, as compared with the radiography group, by 6.7% (95% CI, 1.2 to 13.6; P=0.02).
Conclusions: Screening with the use of low-dose CT reduces mortality from lung cancer. (Funded by the National Cancer Institute; National Lung Screening Trial ClinicalTrials.gov number, NCT00047385).
Data Availability: A summary of the National Lung Screening Trial and its available datasets are provided on the Cancer Data Access System (CDAS). CDAS is maintained by Information Management System (IMS), contracted by the National Cancer Institute (NCI) as keepers and statistical analyzers of the NLST trial data. The full clinical data set from NLST is available through CDAS. Users of TCIA can download without restriction a publicly distributable subset of that clinical data, along with the CT and Histopathology images collected during the trial. (These previously were restricted.)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
About Dataset 📌 Overview This dataset has been carefully synthesized to support research in lung cancer survival prediction, enabling the development of models that estimate:
Whether a patient is likely to survive at least one year post-diagnosis (Binary Classification). The probability of survival based on clinical and lifestyle factors (Regression Analysis). The dataset is designed for machine learning and deep learning applications in medical AI, oncology research, and predictive healthcare.
📜 Dataset Generation Process The dataset was generated using a combination of real-world epidemiological insights, medical literature, and statistical modeling. The feature distributions and relationships have been carefully modeled to reflect real-world clinical scenarios, ensuring biomedical validity.
📖 Medical References & Sources The dataset structure is based on well-established lung cancer risk factors and survival indicators documented in leading medical research and clinical guidelines:
World Health Organization (WHO) Reports on lung cancer epidemiology. National Cancer Institute (NCI) & American Cancer Society (ACS) guidelines on lung cancer risk factors and treatment outcomes. The IASLC Lung Cancer Staging Project (8th Edition): Standard reference for lung cancer staging. Harrison’s Principles of Internal Medicine (20th Edition): Provides an in-depth review of lung cancer diagnosis and treatment. Lung Cancer: Principles and Practice (2022, Oxford University Press): Clinical insights into lung cancer detection, treatment, and survival factors. 🔬 Features of the Dataset Each record in the dataset represents an individual’s clinical condition, lifestyle risk factors, and survival outcome. The dataset includes the following features:
1️⃣ Patient Demographics Age → A key risk factor for lung cancer progression and survival. Gender → Male and female lung cancer survival rates can differ. Residence → Urban vs. Rural (impact of environmental factors). 2️⃣ Risk Factors & Lifestyle Indicators These factors have been linked to lung cancer risk in epidemiological studies:
Smoking Status → (Current Smoker, Former Smoker, Never Smoked). Air Pollution Exposure → (Low, Moderate, High). Biomass Fuel Use → (Yes/No) – Associated with household air pollution. Factory Exposure → (Yes/No) – Industrial exposure increases lung cancer risk. Family History → (Yes/No) – Genetic predisposition to lung cancer. Diet Habit → (Vegetarian, Non-Vegetarian, Mixed) – Nutritional impact on cancer progression. 3️⃣ Symptoms (Primary Predictors) These are key clinical indicators associated with lung cancer detection and severity:
Hemoptysis (Coughing Blood) Chest Pain Fatigue & Weakness Chronic Cough Unexplained Weight Loss 4️⃣ Tumor Characteristics & Clinical Features Tumor Size (mm) → The size of the detected tumor. Histology Type → (Adenocarcinoma, Squamous Cell Carcinoma, Small Cell Carcinoma). Cancer Stage → (Stage I to Stage IV). 5️⃣ Treatment & Healthcare Facility Treatment Received → (Surgery, Chemotherapy, Radiation, Targeted Therapy). Hospital Type → (Private, Government, Medical College). 6️⃣ Target Variables (Predicted Outcomes) Survival (Binary) → 1 (Yes) if the patient survives at least 1 year, 0 (No) otherwise. Survival Probability (%) (Can be derived) → Estimated probability of survival within one year. ⚡ Why This Dataset is Valuable? ✅ Balanced Data Distribution Designed to ensure a representative distribution of lung cancer survival cases. Prevents model bias and improves generalization in predictive models. ✅ Medically-Inspired Feature Engineering Features are derived from real-world lung cancer risk factors, validated through medical literature. Incorporates both lifestyle and clinical indicators to enhance predictive accuracy.(no real person data is used,just have made an biomedical environment) ✅ Diverse Risk Factors Considered Smoking, air pollution, and genetic history as primary lung cancer contributors. Symptom severity and tumor histology influence survival rates. ✅ Scalability & ML Suitability Ideal for classification and regression tasks in machine learning. Can be used with deep learning (TensorFlow, PyTorch), ML models (XGBoost, Random Forest, SVM), and explainable AI techniques like SHAP and LIME. 📂 Dataset Usage & Applications This dataset is highly useful for multiple healthcare AI applications, including:
🩺 Predictive Analytics → Early detection of high-risk lung cancer patients. 🤖 Healthcare Chatbots → AI-powered risk assessment tools.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains Cancer Incidence data for Lung Cancer (All Stages^) including: Age-Adjusted Rate, Confidence Interval, Average Annual Count, and Trend field information for US States for the average 5 year span from 2016 to 2020.Data are segmented by sex (Both Sexes, Male, and Female) and age (All Ages, Ages Under 50, Ages 50 & Over, Ages Under 65, and Ages 65 & Over), with field names and aliases describing the sex and age group tabulated.For more information, visit statecancerprofiles.cancer.govData NotationsState Cancer Registries may provide more current or more local data.TrendRising when 95% confidence interval of average annual percent change is above 0.Stable when 95% confidence interval of average annual percent change includes 0.Falling when 95% confidence interval of average annual percent change is below 0.† Incidence rates (cases per 100,000 population per year) are age-adjusted to the 2000 US standard population (19 age groups: <1, 1-4, 5-9, ... , 80-84, 85+). Rates are for invasive cancer only (except for bladder cancer which is invasive and in situ) or unless otherwise specified. Rates calculated using SEER*Stat. Population counts for denominators are based on Census populations as modified by NCI. The US Population Data File is used for SEER and NPCR incidence rates.‡ Incidence Trend data come from different sources. Due to different years of data availability, most of the trends are AAPCs based on APCs but some are APCs calculated in SEER*Stat. Please refer to the source for each area for additional information.Rates and trends are computed using different standards for malignancy. For more information see malignant.^ All Stages refers to any stage in the Surveillance, Epidemiology, and End Results (SEER) summary stage.Data Source Field Key(1) Source: National Program of Cancer Registries and Surveillance, Epidemiology, and End Results SEER*Stat Database - United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute. Based on the 2022 submission.(5) Source: National Program of Cancer Registries and Surveillance, Epidemiology, and End Results SEER*Stat Database - United States Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute. Based on the 2022 submission.(6) Source: National Program of Cancer Registries SEER*Stat Database - United States Department of Health and Human Services, Centers for Disease Control and Prevention (based on the 2022 submission).(7) Source: SEER November 2022 submission.(8) Source: Incidence data provided by the SEER Program. AAPCs are calculated by the Joinpoint Regression Program and are based on APCs. Data are age-adjusted to the 2000 US standard population (19 age groups: <1, 1-4, 5-9, ... , 80-84,85+). Rates are for invasive cancer only (except for bladder cancer which is invasive and in situ) or unless otherwise specified. Population counts for denominators are based on Census populations as modified by NCI. The US Population Data File is used with SEER November 2022 data.Some data are not available, see Data Not Available for combinations of geography, cancer site, age, and race/ethnicity.Data for the United States does not include data from Nevada.Data for the United States does not include Puerto Rico.
Rate: Number of deaths due to cancer of the trachea, bronchus, and lung per 100,000 Population.
Definition: Number of deaths per 100,000 with malignant neoplasm (cancer) cancer of the trachea, bronchus, and lung as the underlying cause (ICD-10 codes: C33-C34).
Data Sources:
(1) Centers for Disease Control and Prevention, National Center for Health Statistics. Compressed Mortality File. CDC WONDER On-line Database accessed at http://wonder.cdc.gov/cmf-icd10.html
(2) Death Certificate Database, Office of Vital Statistics and Registry, New Jersey Department of Health
(3) Population Estimates, State Data Center, New Jersey Department of Labor and Workforce Development
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lung cancer is the number one cancer-related cause of death in Sweden and worldwide. In most countries, five-year survival estimates vary between 10% and 20% with evidence of improved survival over time. Over the last decades, the management of lung cancer has changed including the introduction of national guidelines, new diagnostic procedures and treatments. This study aimed to investigate temporal trends in lung cancer survival both overall and in subgroups defined by established prognostic factors (i.e., sex, stage, histopathology and smoking history). We estimated one-, two-, and five-year relative survival, and excess mortality, in patients diagnosed with squamous cell carcinoma or adenocarcinoma of the lung between 1995 and 2016 in Sweden. We used population-based information available in a national lung cancer research database (LCBaSe) generated by cross-linkage between the Swedish National Lung Cancer Register and several Swedish health and sociodemographic registers. We included 36,935 patients diagnosed with squamous cell carcinoma or adenocarcinoma of the lung between 1995 and 2016. The overall one-, two- and five-year survival estimates increased between 1995 and 2016, from 38% to 53%, 21% to 37%, and 14% to 24%, respectively. Over the study period, we also found improved survival in subgroups, for example in patients with stages III-IV disease, patients with adenocarcinoma, and never-smokers. The excess mortality decreased over the study period, both overall and in all subgroups. Lung cancer survival increased over time in the overall lung cancer population. Of special note was evidence of improved survival in patients with stage IV disease. Our results corroborate a previously observed global trend of improved survival in patients with lung cancer.
Population based cancer incidence rates were abstracted from National Cancer Institute, State Cancer Profiles for all available counties in the United States for which data were available. This is a national county-level database of cancer data that are collected by state public health surveillance systems. All-site cancer is defined as any type of cancer that is captured in the state registry data, though non-melanoma skin cancer is not included. All-site age-adjusted cancer incidence rates were abstracted separately for males and females. County-level annual age-adjusted all-site cancer incidence rates for years 2006–2010 were available for 2687 of 3142 (85.5%) counties in the U.S. Counties for which there are fewer than 16 reported cases in a specific area-sex-race category are suppressed to ensure confidentiality and stability of rate estimates; this accounted for 14 counties in our study. Two states, Kansas and Virginia, do not provide data because of state legislation and regulations which prohibit the release of county level data to outside entities. Data from Michigan does not include cases diagnosed in other states because data exchange agreements prohibit the release of data to third parties. Finally, state data is not available for three states, Minnesota, Ohio, and Washington. The age-adjusted average annual incidence rate for all counties was 453.7 per 100,000 persons. We selected 2006–2010 as it is subsequent in time to the EQI exposure data which was constructed to represent the years 2000–2005. We also gathered data for the three leading causes of cancer for males (lung, prostate, and colorectal) and females (lung, breast, and colorectal). The EQI was used as an exposure metric as an indicator of cumulative environmental exposures at the county-level representing the period 2000 to 2005. A complete description of the datasets used in the EQI are provided in Lobdell et al. and methods used for index construction are described by Messer et al. The EQI was developed for the period 2000– 2005 because it was the time period for which the most recent data were available when index construction was initiated. The EQI includes variables representing each of the environmental domains. The air domain includes 87 variables representing criteria and hazardous air pollutants. The water domain includes 80 variables representing overall water quality, general water contamination, recreational water quality, drinking water quality, atmospheric deposition, drought, and chemical contamination. The land domain includes 26 variables representing agriculture, pesticides, contaminants, facilities, and radon. The built domain includes 14 variables representing roads, highway/road safety, public transit behavior, business environment, and subsidized housing environment. The sociodemographic environment includes 12 variables representing socioeconomics and crime. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Human health data are not available publicly. EQI data are available at: https://edg.epa.gov/data/Public/ORD/NHEERL/EQI. Format: Data are stored as csv files. This dataset is associated with the following publication: Jagai, J., L. Messer, K. Rappazzo , C. Gray, S. Grabich , and D. Lobdell. County-level environmental quality and associations with cancer incidence#. Cancer. John Wiley & Sons Incorporated, New York, NY, USA, 123(15): 2901-2908, (2017).
Death rate has been age-adjusted by the 2000 U.S. standard population. Single-year data are only available for Los Angeles County overall, Service Planning Areas, Supervisorial Districts, City of Los Angeles overall, and City of Los Angeles Council Districts.Lung cancer is a leading cause of cancer-related death in the US. People who smoke have the greatest risk of lung cancer, though lung cancer can also occur in people who have never smoked. Most cases are due to long-term tobacco smoking or exposure to secondhand tobacco smoke. Cities and communities can take an active role in curbing tobacco use and reducing lung cancer by adopting policies to regulate tobacco retail; reducing exposure to secondhand smoke in outdoor public spaces, such as parks, restaurants, or in multi-unit housing; and improving access to tobacco cessation programs and other preventive services.For more information about the Community Health Profiles Data Initiative, please see the initiative homepage.
This map service portrays the number of deaths per 100,000 people per square mile from lung and colon cancer. It displays the distribution of lung and colon cancer across the United States. Pop-ups show attributes such as state name, county name, number of colon or lung cancer deaths, and square miles per area.Lung Cancer: Death due to malignant neoplasm of the trachea, bronchus and lung.Colon Cancer: Death due to malignant neoplasm of the colon, rectum and anus.This data was sourced from: Community Health Status Indicators_Other Health Datapalooza focused content that may interest you: Health Datapalooza Health Datapalooza
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions. It is a web-accessible international resource for development, training, and evaluation of computer-assisted diagnostic (CAD) methods for lung cancer detection and diagnosis. Initiated by the National Cancer Institute (NCI), further advanced by the Foundation for the National Institutes of Health (FNIH), and accompanied by the Food and Drug Administration (FDA) through active participation, this public-private partnership demonstrates the success of a consortium founded on a consensus-based process.
Seven academic centers and eight medical imaging companies collaborated to create this data set which contains 1018 cases. Each subject includes images from a clinical thoracic CT scan and an associated XML file that records the results of a two-phase image annotation process performed by four experienced thoracic radiologists. In the initial blinded-read phase, each radiologist independently reviewed each CT scan and marked lesions belonging to one of three categories ("nodule > or =3 mm," "nodule <3 mm," and "non-nodule > or =3 mm"). In the subsequent unblinded-read phase, each radiologist independently reviewed their own marks along with the anonymized marks of the three other radiologists to render a final opinion. The goal of this process was to identify as completely as possible all lung nodules in each CT scan without requiring forced consensus.
Note : The TCIA team strongly encourages users to review pylidc and the Standardized representation of the TCIA LIDC-IDRI annotations using DICOM (DICOM-LIDC-IDRI-Nodules) of the annotations/segmentations included in this dataset before developing custom tools to analyze the XML version.
The National Lung Screening Trial (NLST) was a randomized controlled trial conducted by the Lung Screening Study group (LSS) and the American College of Radiology Imaging Network (ACRIN) to determine whether screening for lung cancer with low-dose helical computed tomography (CT) reduces mortality from lung cancer in high-risk individuals relative to screening with chest radiography. Approximately 54,000 participants were enrolled between August 2002 and April 2004. Data collection has ended, and information is complete through December 31, 2009. NLST has the ClinicalTrials.gov registration number NCT00047385.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Among many types of cancers, to date, lung cancer remains one of the deadliest cancers around the world. Many researchers, scientists, doctors, and people from other fields continuously contribute to this subject regarding early prediction and diagnosis. One of the significant problems in prediction is the black-box nature of machine learning models. Though the detection rate is comparatively satisfactory, people have yet to learn how a model came to that decision, causing trust issues among patients and healthcare workers. This work uses multiple machine learning models on a numerical dataset of lung cancer-relevant parameters and compares performance and accuracy. After comparison, each model has been explained using different methods. The main contribution of this research is to give logical explanations of why the model reached a particular decision to achieve trust. This research has also been compared with a previous study that worked with a similar dataset and took expert opinions regarding their proposed model. We also showed that our research achieved better results than their proposed model and specialist opinion using hyperparameter tuning, having an improved accuracy of almost 100% in all four models.
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
National Cancer Registration And Analysis Service (NCRAS). (2019). Cancer Registration: Frequency of lung Cancer tumours Diagnosed in 2015-2016 by CCG and Route to Diagnosis (2015 -2016) [Dataset]. Public Health England. https://doi.org/10.25503/7gpv-d753
Aggregated data on lung cancers tumours (ICD-10 C33-C34) diagnosed between 2015-2016 in English resident population.
Data within the File: - PATIENTS (Count of tumours) - CCG_NAME (Name of resident CCG) - CCG_ROUTE (Name of Route to Diagnosis)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This synthetic Lung Cancer Risk Prediction Dataset is designed for educational and research purposes in the fields of data science, public health, and cancer research. It contains essential health and lifestyle indicators such as smoking habits, chronic diseases, and respiratory symptoms, which can be used to analyze and predict the risk of lung cancer. The dataset is ideal for building predictive models, conducting risk assessments, and exploring the relationships between lifestyle factors and lung health.
This dataset is ideal for various lung cancer-related applications:
CC0 (Public Domain)
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This table contains 30810 series, with data for years 2001/2003 - 2013/2015 (not all combinations necessarily have data for all years). This table contains data described by the following dimensions (Not all combinations are available): Geography (158 items: Canada; Newfoundland and Labrador; Eastern Regional Health Authority, Newfoundland and Labrador; Central Regional Health Authority, Newfoundland and Labrador; ...); Sex (3 items: Both sexes; Males; Females); Selected sites of cancer (ICD-O-3) (5 items: All invasive primary cancer sites (including in situ bladder); Colon, rectum and rectosigmoid junction cancer; Bronchus and lung cancer; Female breast cancer; ...); Characteristics (13 items: Number of new cancer cases; Cancer incidence (rate per 100,000 population); Low 95% confidence interval, cancer incidence (rate per 100,000 population); High 95% confidence interval, cancer incidence (rate per 100,000 population); ...).
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Years of Life Lost (YLL) as a result of death from lung cancer - Directly age-Standardised Rates (DSR) per 100,000 population Source: Office for National Statistics (ONS) Publisher: Information Centre (IC) - Clinical and Health Outcomes Knowledge Base Geographies: Local Authority District (LAD), Government Office Region (GOR), National, Primary Care Trust (PCT), Strategic Health Authority (SHA) Geographic coverage: England Time coverage: 2005-07, 2007 Type of data: Administrative data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We analyzed NSCLC data from the Surveillance, Epidemiology, and End Results database, focusing on lung adenocarcinoma (LUAD) and lung squamous carcinoma (LUSC).
This study will explore the correlation of biomarkers with response rate, and the overall efficacy and safety, of Avastin in combination with carboplatin-based chemotherapy in patients with advanced or recurrent non-squamous non-small cell lung cancer. Patients will be randomized to one of 2 groups, to receive either Avastin 7.5mg/kg iv on day 1 of each 3 week cycle, or Avastin 15mg/kg iv on day 1 of each 3 week cycle; all patients will also receive treatment with carboplatin and either gemcitabine or paclitaxel for a maximum of 6 cycles. The anticipated time on study treatment is until disease progression, and the target sample size is 100-500 individuals.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
What is Lung Cancer Dataset?
The effectiveness of the cancer prediction system helps people to know their cancer risk at a low cost and it also helps the people to take the appropriate decision based on their cancer risk status. The data is collected from the website online lung cancer prediction system.
.
https://user-images.githubusercontent.com/36210723/182395183-ef7519e3-9c18-47ac-b7a6-a00e234f3949.png" alt="2022-08-02_170741">
.
Acknowledgments
When we use this dataset in our research, we credit the authors as :
License : CC BY 4.0.
Hong, Z.Q. and Yang, J.Y. "Optimal Discriminant Plane for a Small Number of Samples and Design Method of Classifier on the Plane", Pattern Recognition, Vol. 24, No. 4, pp. 317-324, 1991 and it is published t to reuse in google research dataset
The main idea for uploading this dataset is to practice data analysis with my students, as I am working in college and want my student to train our studying ideas in a big dataset, It may be not up to date and I mention the collecting years, but it is a good resource of data to practice