86 datasets found

Lung-Cancer-Risk-Dataset

kaggle.com

Updated Aug 23, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mikey-TraceGod (2025). Lung-Cancer-Risk-Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12844025

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/12844025

Dataset updated

Aug 23, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mikey-TraceGod

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Lung Cancer Risk Dataset

Overview

This dataset contains 50,000 patient profiles designed for lung cancer risk analysis and machine learning applications. The dataset is clean, preprocessed, and ready for immediate use in classification tasks, statistical analysis, and data visualization.

Rows: 50,000
Columns: 11
File: preprocessed_lung_cancer_dataset.csv
License: CC0: Public Domain

Dataset Description

The dataset includes patient profiles with features based on established lung cancer risk factors such as smoking history, environmental exposures, and chronic lung conditions. All data is synthetic and designed to reflect realistic risk factor distributions while maintaining patient privacy.

Features

Column	Type	Description	Values/Range
patient_id	Integer	Unique patient identifier	100000-149999
age	Integer	Patient age in years	18-100
gender	String	Patient gender	'Male', 'Female'
pack_years	Float	Smoking exposure (years × packs per day)	0-100
radon_exposure	String	Residential radon exposure level	'Low', 'Medium', 'High'
asbestos_exposure	String	Occupational asbestos exposure history	'Yes', 'No'
secondhand_smoke_exposure	String	Passive smoking exposure	'Yes', 'No'
copd_diagnosis	String	Chronic obstructive pulmonary disease diagnosis	'Yes', 'No'
alcohol_consumption	String	Alcohol consumption pattern	'None', 'Moderate', 'Heavy'
family_history	String	Family history of lung cancer	'Yes', 'No'
lung_cancer	String	Target variable: Lung cancer diagnosis	'Yes', 'No'

Data Quality

Complete: No missing values or duplicates
Clean: All values within realistic ranges
Balanced Features: Realistic distribution of risk factors
Target Distribution: Approximately 25% positive cases, reflecting real-world lung cancer prevalence

Use Cases

Binary classification modeling
Risk factor correlation analysis
Data visualization and exploratory analysis
Machine learning pipeline development
Statistical hypothesis testing

The associations of sitting time and physical activity on total and...
plos.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vegar Rangul; Erik R. Sund; Paul Jarle Mork; Oluf Dimitri Røe; Adrian Bauman (2023). The associations of sitting time and physical activity on total and site-specific cancer incidence: Results from the HUNT study, Norway [Dataset]. http://doi.org/10.1371/journal.pone.0206015
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0206015
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Vegar Rangul; Erik R. Sund; Paul Jarle Mork; Oluf Dimitri Røe; Adrian Bauman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Norway
Description
BackgroundSedentary behavior is thought to pose different risks to those attributable to physical inactivity. However, few studies have examined the association between physical activity and sitting time with cancer incidence within the same population.MethodsWe followed 38,154 healthy Norwegian adults in the Nord-Trøndelag Health Study (HUNT) for cancer incidence from 1995–97 to 2014. Cox proportional hazards regression was used to estimate risk of site-specific and total cancer incidence by baseline sitting time and physical activity.ResultsDuring the 16-years follow-up, 4,196 (11%) persons were diagnosed with cancer. We found no evidence that people who had prolonged sitting per day or had low levels of physical activity had an increased risk of total cancer incidence, compared to those who had low sitting time and were physically active. In the multivariate model, sitting ≥8 h/day was associated with 22% (95% CI, 1.05–1.42) higher risk of prostate cancer compared to sitting 16.6 MET-h/week). The joint effects of physical activity and sitting time the indicated that prolonged sitting time increased the risk of CRC independent of physical activity in men.ConclusionsOur findings suggest that prolonged sitting and low physical activity are positively associated with colorectal-, prostate- and lung cancer among men. Sitting time and physical activity were not associated with cancer incidence among women. The findings emphasizing the importance of reducing sitting time and increasing physical activity.
Incidence of lung cancer in Europe in 2022, by country and gender
statista.com
Updated Sep 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Incidence of lung cancer in Europe in 2022, by country and gender [Dataset]. https://www.statista.com/statistics/1418818/incidence-of-lung-cancer-in-europe/
Explore at:
Dataset updated
Sep 16, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Europe, EU
Description
In 2022, the incidence of lung cancer among men in Europe was highest in Hungary at ***** per 100,000, while Sweden had the lowest incidence. The incidence of lung cancer recorded among women in Denmark was over ** per 100,000 population. Across the European Union overall, the rate of lung cancer diagnoses was **** per 100,000 among men and **** per 100,000 among women. Smoking and lung cancer risk The connection between smoking and the increased risk of health problems is well established. As of 2021, Hungary had one of the highest daily smoking rates in Europe, with over a quarter of adults smoking daily in the Central European country. The only other countries with a higher share of smoking adults were Bulgaria and Turkey. A positive development though, is the share of adults smoking every day has decreased in almost every European country since 2011. The rise of vaping Originally marketed as a device to help smokers quit, e-cigarettes or vapes have seen increased popularity among people who never smoked cigarettes, especially young people. The use of vapes among young people was reported to be highest in Estonia, Czechia, and Ireland. The dangers of vaping have not been examined over the long term. In the EU there have been attempts to make ‘vapes’ less accessible and appealing for young people, which would include such things as banning flavors and stopping the sale of disposable e-cigarettes.
f
Data from: Identification of cancer chemotherapy regimens and patient...
datasetcatalog.nlm.nih.gov
tandf.figshare.com
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mendelsohn, Aaron B.; Lockhart, Catherine M.; McDermott, Cara L.; DeFor, Terese A.; Pawloski, Pamala A.; Jamal-Allial, Aziza; Benitez, Gabriela Vazquez; Marshall, James; Yee, Gary; Djibo, Djeneba Audrey; Li, Minghui Sam; McBride, Ali (2023). Identification of cancer chemotherapy regimens and patient cohorts in administrative claims: challenges, opportunities, and a proposed algorithm [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000990575
Explore at:
Dataset updated
Mar 8, 2023
Authors
Mendelsohn, Aaron B.; Lockhart, Catherine M.; McDermott, Cara L.; DeFor, Terese A.; Pawloski, Pamala A.; Jamal-Allial, Aziza; Benitez, Gabriela Vazquez; Marshall, James; Yee, Gary; Djibo, Djeneba Audrey; Li, Minghui Sam; McBride, Ali
Description
Real-world evidence is a valuable source of information in healthcare. This study describes the challenges and successes during algorithm development to identify cancer cohorts and multi-agent chemotherapy regimens from claims data to perform a comparative effectiveness analysis of granulocyte colony stimulating factor (G-CSF) use. Using the Biologics and Biosimilars Collective Intelligence Consortium’s Distributed Research Network, we iteratively developed and tested a de novo algorithm to accurately identify patients by cancer diagnosis, then extract chemotherapy and G-CSF administrations for a retrospective study of prophylactic G-CSF. After identifying patients with cancer and subsequent chemotherapy exposures, we observed only 12% of patients with cancer received chemotherapy, which is fewer than expected based on prior analyses. Therefore, we reversed the initial inclusion criteria to identify chemotherapy receipt, then prior cancer diagnosis, which increased the number of patients from 2,814 to 3,645, or 68% of patients receiving chemotherapy had diagnoses of interest. Additionally, we excluded patients with cancer diagnoses that differed from those of interest in the 183 days before the index date of G-CSF receipt, including early-stage cancers without G-CSF or chemotherapy exposure. By removing this criterion, we retained 77 patients who were previously excluded. Finally, we incorporated a 5-day window to identify all chemotherapy drugs administered (excluding oral prednisone and methotrexate, as these medications may be used for other non-malignant conditions) as patients may fill oral prescriptions days to weeks prior to infusion. This increased the number of patients with chemotherapy exposures of interest to 6,010. The final cohort of included patients, based on G-CSF exposure, increased from 420 from the initial algorithm to 886 using the final algorithm. Medications used for multiple indications, sensitivity and specificity of administrative codes, and relative timing of medication exposure must all be evaluated to identify patient cohorts receiving chemotherapy from claims data.
Cancer Mortality in People Treated with Antidepressants before Cancer...
plos.figshare.com
ai
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuelian Sun; Peter Vedsted; Morten Fenger-Grøn; Chun Sen Wu; Bodil Hammer Bech; Jørn Olsen; Michael Eriksen Benros; Mogens Vestergaard (2023). Cancer Mortality in People Treated with Antidepressants before Cancer Diagnosis: A Population Based Cohort Study [Dataset]. http://doi.org/10.1371/journal.pone.0138134
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0138134
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yuelian Sun; Peter Vedsted; Morten Fenger-Grøn; Chun Sen Wu; Bodil Hammer Bech; Jørn Olsen; Michael Eriksen Benros; Mogens Vestergaard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundDepression is common after a cancer diagnosis and is associated with an increased mortality, but it is unclear whether depression occurring before the cancer diagnosis affects cancer mortality. We aimed to study cancer mortality of people treated with antidepressants before cancer diagnosis.Methods and FindingsWe conducted a population based cohort study of all adults diagnosed with cancer between January 2003 and December 2010 in Denmark (N = 201,662). We obtained information on cancer from the Danish Cancer Registry, on the day of death from the Danish Civil Registry, and on redeemed antidepressants from the Danish National Prescription Registry. Current users of antidepressants were defined as those who redeemed the latest prescription of antidepressant 0–4 months before cancer diagnosis (irrespective of earlier prescriptions), and former users as those who redeemed the latest prescription five or more months before cancer diagnosis. We estimated an all-cause one-year mortality rate ratio (MRR) and a conditional five-year MRR for patients who survived the first year after cancer diagnosis and confidence interval (CI) using a Cox proportional hazards regression model. Overall, 33,111 (16.4%) patients redeemed at least one antidepressant prescription in the three years before cancer diagnosis of whom 21,851 (10.8%) were current users at the time of cancer diagnosis. Current antidepressant users had a 32% higher one-year mortality (MRR = 1.32, 95% CI: 1.29–1.35) and a 22% higher conditional five-year mortality (MRR = 1.22, 95% CI: 1.17–1.26) if patients survived the first year after the cancer diagnosis than patients not redeeming antidepressants. The one-year mortality was particularly high for patients who initiated antidepressant treatment within four months before cancer diagnosis (MRR = 1.54, 95% CI: 1.47–1.61). Former users had no increased cancer mortality.ConclusionsInitiation of antidepressive treatment prior to cancer diagnosis is common and is associated with an increased mortality.
D
Data from: Data belonging to 'Smoking intensity and bladder cancer...
lifesciences.datastations.nl
tsv, zip
Updated Feb 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L.A.L.M. Kiemeney; L.A.L.M. Kiemeney (2018). Data belonging to 'Smoking intensity and bladder cancer aggressiveness at diagnosis' [Dataset]. http://doi.org/10.17026/DANS-2A6-ATE2
Explore at:
zip(22047), tsv(80480), tsv(82205)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-2A6-ATE2
Dataset updated
Feb 12, 2018
Dataset provided by
DANS Data Station Life Sciences
Authors
L.A.L.M. Kiemeney; L.A.L.M. Kiemeney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set is part of the Nijmegen Bladder Cancer Study, one of the largest series of bladder cancer in the world (see https://icbc.cancer.gov/). The data were used to investigate the relationship between smoking and bladder cancer aggressiveness at diagnosis. The results will be published as Barbosa A.L.A. et al., Smoking intensity and bladder cancer aggressiveness at diagnosis. Plos One (submitted).The Nijmegen Bladder Cancer Study (NBCS) has been described in more detail in (http://www.ncbi.nlm.nih.gov/pubmed/25023787). Briefly, BC patients diagnosed between 1995-2011 under the age of 75 years in the mid-eastern part of the Netherlands were identified through the Netherlands Cancer Registry (NCR) held by the Netherlands Comprehensive Cancer Organization (IKNL) and contacted via their treating physicians. Patients who consented to participate in the study were asked to fill out a lifestyle questionnaire, including questions on education, occupation, medical history, physical activity, and complete history of smoking. Furthermore, blood samples were collected by Thrombosis Service centers, which hold offices in all the communities in the region. The study was approved by the institutional review board of the Radboud university medical center, Nijmegen, The Netherlands (CMO Arnhem-Nijmegen). A total of 1859 BC patients were included in the study.Smoking assessmentInformation on smoking history was obtained via the lifestyle questionnaire. Patients were asked for their smoking status at recruitment, age at smoking initiation and cessation, number of cigarettes, pipes and cigars smoked per day and duration of smoking in years. The timing of smoking cessation with respect to the diagnosis was calculated as age at diagnosis minus age at cessation. Smoking status at diagnosis was classified as never smoker, former smoker (quitted >1 year before diagnosis), current smoker (continuing cigarette smoker or quitted ≤ 1 year before diagnosis). Ever smokers were defined as the combination of former and current smokers. In the current smokers group, only the smoking period in years before the diagnosis was considered. Smoking amount was evaluated as cigarettes per day. Cumulative smoking exposure (in pack-years) was calculated by multiplying the cigarette smoking duration and packages per day (20 cigarettes representing one package). Pipe and/or cigar smoking (5.9% of all patients) was ignored in the main analyses, assuming that the majority of Dutch pipe and cigar smokers do not inhale the smoke.Outcome assessmentDetailed clinical data concerning age at diagnosis, tumor stage, tumor grade, tumor number (single or multiple), tumor size (<3cm and ≥ 3cm), presence of concomitant CIS, and histological type were collected through a medical file survey. Tumor stage and grade were recorded according to the final conclusion in the pathology report. Tumors with WHO 1973 differentiation grade 1 or 2, WHO/ISUP 2004 low grade, or Malmström (Modified Bergkvist) grade 1 or 2a were considered low-grade tumors. We classified tumors with WHO 1973 differentiation grade 3, WHO/ISUP 2004 high grade, or Malmström (Modified Bergkvist) grade 2b or 3 as high-grade. Tumor aggressiveness was classified according to the risk of progression as follows: low-risk NMIBC (low-grade Ta tumors), high-risk NMIBC (all stage T1 tumors, all high-grade tumors, or CIS) and MIBC (stage ≥ T2 or any stage with ≥N1 and/or M1 ).Statistical analysisPatient and tumor characteristics were compared between the smoking status categories using chi-square, Fisher exact, and one-way analysis of variance (ANOVA) tests where appropriate. The distribution of continuous smoking variables was compared between the categories of tumor multiplicity and tumor aggressiveness and tested for statistical significance using the non-parametric Kruskal-Wallis test. Multinomial logistic regression was used to analyze the relation between smoking intensity and aggressiveness of the tumor with adjustment for gender and age at diagnosis. Low-risk NMIBC was considered as the reference group. We repeated similar analyses for tumor multiplicity as the dependent variable using solitary tumors as the reference group. The association of each smoking intensity variable (smoking amount, smoking duration and cumulative smoking exposure), age at smoking initiation, and time since smoking cessation was assessed separately in ever, former and current smokers. Statistical analysis was performed using IBM SPSS Statistics for Windows 20 (IBCM Corp., Armonk, NY, USA) with a p value < 0.05 indicating statistical significance.This dataset contains the statistical datafile (SPSS) used for the data analyses, saved as a .sav and a .por.
Cancer Registration: National Cancer Patient Experience Survey Wave 1 by...
data.europa.eu
ckan.publishing.service.gov.uk
excel xlsx
Updated Oct 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public Health England (2021). Cancer Registration: National Cancer Patient Experience Survey Wave 1 by patient characteristics and route to diagnosis [Dataset]. https://data.europa.eu/data/datasets/ncpes-wave-1-by-patient-characteristics-and-route-to-diagnosis
Explore at:
excel xlsxAvailable download formats
Dataset updated
Oct 11, 2021
Dataset authored and provided by
Public Health Englandhttps://www.gov.uk/government/organisations/public-health-england
License
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Description
The English Cancer Patient Experience Survey (CPES) is commissioned by NHS England and administered on their behalf by an external survey provider organisation (Quality Health). The survey provides insights into the care experienced by cancer patients across England who were treated as day cases or inpatients. Data from CPES has been linked to cancer registration records recorded by the National Cancer Registration and Analysis Service (the cancer registry in England). Individual responses to Wave 1 of CPES are recorded , alongside characteristics of the patient who has completed the survey.

Wave 1 of the National Cancer Patient Experience Survey is limited to patients discharged from cancer care between 01/01/2010 – 31/03/2010.

Data within the file: --PATIENT_PSEUDO_ID (Project specific Pseudonymised Patient ID) GENDER (coded Male, Female) --QUINTILE2010 (Deprivation quintile [1-5], describing the Income Deprivation Domain where 1= least deprived and 5= most deprived) --FINAL_ROUTE (One of eight Routes to Diagnosis- methodology for the assignment of each route is described in Elliss-Brookes L, McPhail S, Greenslade M, Shelton J, Hiom S, Richards M (2012) Routes to diagnosis for cancer – determining the patient journey using multiple routine data sets. British Journal of Cancer 107: 1220–1226.) --AGE (aggregated in 4 categories: <55, 55-64, 65-74, 75+) --STAGE (stage of the cancer coded as I, II, III, IV, missing) --CANCER_SITE (Cancer sites coded in accordance with ICD 10: C00-C14, C15, C16, C18, C19-C20, C25, C33-C34, C43, C49, C50, C54, C56, C61, C64, C67, C73, C82, C83, C85, C90, C91-C95, D05 and ‘all other ICD-10 codes’

Specific disclosure controls applied: --Gender omitted from the data specification in the following cancer sites: • Female only for C50, D05 and C73 • Male only for C49
--Self-reported ethnicity (from the CPES surveys) aggregated into white British / non-white British / not specified. --Self-reported ethnicity omitted for C49, C64, C73 (replaced as “missing”).
Mortality rate from cancer in Russia 2023, by federal subject
statista.com
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Mortality rate from cancer in Russia 2023, by federal subject [Dataset]. https://www.statista.com/statistics/1168769/death-rate-by-cancer-by-federal-subject-russia/
Explore at:
Dataset updated
Nov 29, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
Russia
Description
In 2023, around *** deaths per 100,000 population in Russia were attributed to malignant neoplasms. The highest mortality rate due to that reason across the country was recorded in the Kurgan Oblast, measuring at over *** deaths per 100,000 inhabitants. The Ingushetia Republic had the lowest mortality rate from cancer, at approximately ** deaths per 100,000 population. Cancer mortality in Russia Cancer is the second-leading cause of mortality in Russia, being only superseded by circulatory system diseases which were responsible for *** deaths per 100 thousand population in 2022. However, the number of deaths from cancer has been steadily decreasing year-on-year. In 2021, approximately *** thousand Russians deceased due to a malignant tumor. That marked a four-percent decrease from the previous year. Furthermore, the five-year cancer survival rate reached an all-time maximum. As of 2021, nearly six in ten patients in Russia continued to be registered with an oncological establishment for five years or more after receiving their diagnosis. Growth in cancer risk factors in Russia Some well-known risk factors for cancer include sun exposure, tobacco and alcohol use, a poor diet, and being overweight. Despite the merits of a healthy lifestyle being widely recognized, the share of healthy lifestyle followers in Russia has been following a downward trend over the past years. In particular, the rates of heavy smokers have increased. In 2022, a fifth of Russians consumed one pack of cigarettes a day or more, a three-percent growth from 2020.
Identifying Diseases Treatments in Healthcare Data
kaggle.com
zip
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sagar Maru (2025). Identifying Diseases Treatments in Healthcare Data [Dataset]. https://www.kaggle.com/datasets/marusagar/identifying-diseases-treatments-in-healthcare-data
Explore at:
zip(166655 bytes)Available download formats
Dataset updated
Mar 5, 2025
Authors
Sagar Maru
Description
Identifying Entities (Diseases, Treatments) in Healthcare Data

Finding diseases and treatments in medical text—because even AI needs a medical degree to understand doctor’s notes! 🩺🤖

📊 Understanding the Dataset

In the contemporary healthcare ecosystem, substantial amounts of unstructured textual facts are generated day by day thru electronic health facts (EHRs), medical doctor’s notes, prescriptions, and medical literature. The potential to extract meaningful insights from this records is critical for improving patient care, advancing clinical studies, and optimizing healthcare offerings. The dataset in cognizance incorporates text-based totally scientific statistics, in which sicknesses and their corresponding remedies are embedded inside unstructured sentences.

The dataset consists of categorized textual content samples, that are classified into: -**Train Sentences**: These sentences comprise clinical records, including patient diagnoses and the treatments administered. -**Train Labels**: The corresponding annotations for the train sentences, marking diseases and remedies as named entities. -**Test Sentences**: Similar to educate sentences however used to evaluate model overall performance. -**Test Labels**: The ground reality labels for the test sentences.

A sneak from the dataset may look as follows:

🔍 Example from Dataset:

Train Sentences:

_ "The patient was a 62 -year -old man with squamous epithelium, who was previously treated with success with a combination of radiation therapy and chemotherapy."

Train Labels:

Disease: 🦠 lung cancer

Treatment: 💉 Radiation therapy, chemotherapy

This dataset requires the use of** designated Unit Recognition (NER)** to remove and map and map diseases for related treatments 💊, causing the composition of unarmed medical data for analytical purposes.

⚙️ Dataset Properties

Unnecessary medical text: Data set contains free-powered medical notes, where disease and treatment conditions are clearly mentioned. Removing this information without clear mapping is a challenge.

Many unit types: Datasets contain different - -called institutions such as diseases, treatment, symptoms and possibly medication.

Relevant addiction: Many treatments apply to many diseases, and proper mapping depends on reference. For example, "radiotherapy" is used for different cancers, which makes relevant understanding significantly.

Unbalanced data distribution: Some diseases and treatment can be displayed more often than others, to balance model performance requires techniques such as overfalling, sub -sampling or transmission of learning.

Domain-specific language: is rich in lesson medical terminology, which requires special preprochet using domain-specific NLP techniques and medical oncology such as UML or SNOM CT.

🚧 Challenges Working with Dataset

Complex medical vocabulary: Medical texts often use vocals, which require special NLP models that are trained at the clinical company.

Implicit Relationships: Unlike based datasets, ailment-treatment relationships are inferred from context in preference to explicitly stated.

Synonyms and Abbreviations: Diseases and treatments can be cited the use of special names (e.G., ‘myocardial infarction’ vs. ‘coronary heart assault’). Handling such versions is vital.

Noise in Data: Unstructured records may additionally contain irrelevant records, typographical errors, and inconsistencies that affect extraction accuracy.

🛠️ Approach to Extracting Insights from the Dataset

To extract sicknesses and their respective treatments from this dataset, we follow a based NLP pipeline:

1. Data Preprocessing 🧹

Text Cleaning: Remove needless characters, numbers, and stopwords whilst preserving clinical terms.

Tokenization: Split sentences into phrases for higher processing.

Medical Term Standardization: Use area-precise libraries like SciSpacy to standardize synonyms and abbreviations.

2. Named Entity Recognition (NER) Model Development 🤖

Annotation: Ensure accurate labeling of sicknesses and treatments in the dataset.

Model Selection: Train a deep-mastering-based version like BioBERT or a rule-based model the use of spaCy.

Training: Use annotated data to teach a custom NER model that classifies words as sickness or treatment entities.

Evaluation: Measure precision, bear in mind, and F1-score to evaluate version overall performance.

3. Mapping Diseases to Treatments 🔄

Contextual Relationship Extraction: Identify which treatment corresponds to which sickness using dependency parsing and courting extraction.

Dictionary or Tabular Output: Store extracted mappings in a based layout.

Example Output:

| 🦠 Disease | 💉 Treatments | |----------|--------------------...
Long-term inpatient disease burden in the Adult Life after Childhood Cancer...
plos.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofie de Fine Licht; Kathrine Rugbjerg; Thorgerdur Gudmundsdottir; Trine G. Bonnesen; Peter Haubjerg Asdahl; Anna Sällfors Holmqvist; Laura Madanat-Harjuoja; Laufey Tryggvadottir; Finn Wesenberg; Henrik Hasle; Jeanette F. Winther; Jørgen H. Olsen (2023). Long-term inpatient disease burden in the Adult Life after Childhood Cancer in Scandinavia (ALiCCS) study: A cohort study of 21,297 childhood cancer survivors [Dataset]. http://doi.org/10.1371/journal.pmed.1002296
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pmed.1002296
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sofie de Fine Licht; Kathrine Rugbjerg; Thorgerdur Gudmundsdottir; Trine G. Bonnesen; Peter Haubjerg Asdahl; Anna Sällfors Holmqvist; Laura Madanat-Harjuoja; Laufey Tryggvadottir; Finn Wesenberg; Henrik Hasle; Jeanette F. Winther; Jørgen H. Olsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundSurvivors of childhood cancer are at increased risk for a wide range of late effects. However, no large population-based studies have included the whole range of somatic diagnoses including subgroup diagnoses and all main types of childhood cancers. Therefore, we aimed to provide the most detailed overview of the long-term risk of hospitalisation in survivors of childhood cancer.Methods and findingsFrom the national cancer registers of Denmark, Finland, Iceland, and Sweden, we identified 21,297 5-year survivors of childhood cancer diagnosed with cancer before the age of 20 years in the periods 1943–2008 in Denmark, 1971–2008 in Finland, 1955–2008 in Iceland, and 1958–2008 in Sweden. We randomly selected 152,231 population comparison individuals matched by age, sex, year, and country (or municipality in Sweden) from the national population registers. Using a cohort design, study participants were followed in the national hospital registers in Denmark, 1977–2010; Finland, 1975–2012; Iceland, 1999–2008; and Sweden, 1968–2009. Disease-specific hospitalisation rates in survivors and comparison individuals were used to calculate survivors’ standardised hospitalisation rate ratios (RRs), absolute excess risks (AERs), and standardised bed day ratios (SBDRs) based on length of stay in hospital. We adjusted for sex, age, and year by indirect standardisation. During 336,554 person-years of follow-up (mean: 16 years; range: 0–42 years), childhood cancer survivors experienced 21,325 first hospitalisations for diseases in one or more of 120 disease categories (cancer recurrence not included), when 10,999 were expected, yielding an overall RR of 1.94 (95% confidence interval [95% CI] 1.91–1.97). The AER was 3,068 (2,980–3,156) per 100,000 person-years, meaning that for each additional year of follow-up, an average of 3 of 100 survivors were hospitalised for a new excess disease beyond the background rates. Approximately 50% of the excess hospitalisations were for diseases of the nervous system (19.1% of all excess hospitalisations), endocrine system (11.1%), digestive organs (10.5%), and respiratory system (10.0%). Survivors of all types of childhood cancer were at increased, persistent risk for subsequent hospitalisation, the highest risks being those of survivors of neuroblastoma (RR: 2.6 [2.4–2.8]; n = 876), hepatic tumours (RR: 2.5 [2.0–3.1]; n = 92), central nervous system tumours (RR: 2.4 [2.3–2.5]; n = 6,175), and Hodgkin lymphoma (RR: 2.4 [2.3–2.5]; n = 2,027). Survivors spent on average five times as many days in hospital as comparison individuals (SBDR: 4.96 [4.94–4.98]; n = 422,218). The analyses of bed days in hospital included new primary cancers and recurrences. Of the total 422,218 days survivors spent in hospital, 47% (197,596 bed days) were for new primary cancers and recurrences. Our study is likely to underestimate the absolute overall disease burden experienced by survivors, as less severe late effects are missed if they are treated sufficiently in the outpatient setting or in the primary health care system.ConclusionsChildhood cancer survivors were at increased long-term risk for diseases requiring inpatient treatment even decades after their initial cancer. Health care providers who do not work in the area of late effects, especially those in primary health care, should be aware of this highly challenged group of patients in order to avoid or postpone hospitalisations by prevention, early detection, and appropriate treatments.
c
National Cancer Patient Experience Survey, 2013-2014
datacatalogue.cessda.eu
Updated Nov 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health (2024). National Cancer Patient Experience Survey, 2013-2014 [Dataset]. http://doi.org/10.5255/UKDA-SN-7562-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-7562-1
Dataset updated
Nov 28, 2024
Dataset authored and provided by
Department of Health
Time period covered
Jan 1, 2014 - Jun 1, 2014
Area covered
England
Variables measured
Individuals, National
Measurement technique
Three communications were despatched to patients: initial survey, and two reminders (to non-responders only)., Postal survey
Description
Abstract copyright UK Data Service and data collection copyright owner.
The National Cancer Patient Experience Surveys (NCPES) began in 2010, after the 2007 'Cancer Reform Strategy' set out a commitment to establish a new survey programme. The NCPES is intended to be a vehicle enabling and supporting quality improvement in the NHS and has been used by national bodies, NHS Hospitals, specialist cancer teams, and national and condition specific charities to improve services for patients. It is designed to monitor national progress on cancer care and to help gather vital information on the Transforming Inpatient Care Programme, the National Cancer Survivorship Initiative and the National Cancer Equality Initiative. An Advisory Group was set up for the NCPES with the National Cancer Director, professionals, voluntary sector representatives, academics and patient survey experts. The Group agreed on the following guiding principles and objectives:
a standard national survey tool was to be used
surveys would be conducted at Trust level and identify cancer groups
the survey would cover all cancers and include the whole care pathway
the survey should use the word 'cancer' unlike the 2000 and 2004 surveys
the survey focus would be on patients (rather than carers)
the data would be used for benchmarking performance across Trusts and by cancer groups where numbers allow
the data would be used to inform national and local policy
the data would be made publicly available whilst observing patient data protection requirements and maintaining confidentiality.
The survey is intended to be a vehicle enabling and supporting quality improvement in the NHS and has been used by national bodies, NHS Hospitals, specialist cancer teams, and national and condition specific charities to improve services for patients.

The NCPES has been replicated in Wales (see SN 7510), Northern Ireland, the Isle of Man, parts of Australia, and the Middle East. Further information can be found on the Quality Health Limited National Cancer Patient Experience Survey webpage and the NHS England Cancer Patient Experience Survey webpage.

2010-2015 surveys temporarily withdrawn
The data for the 2010-2014 surveys were temporarily withdrawn at the request of the depositor in October 2015. The 2015 data (SN 8163 and the Special Licence version, SN 8164) were temporarily withdrawn at the request of the depositor in February 2020.

The 2013-2014 survey included all adult patients who were treated for cancer between 1 September and 30 November 2013 in NHS Trusts across England. Patients with all cancers were included, defined by their ICD10 code (cancer diagnosis code). The survey covered both inpatients and day case patients.

Main Topics:
The data cover different stages of the patients' 'cancer journey', from diagnosis to outpatient treatment:
initial GP visits before diagnosis (how many appointments, time period)
diagnostic tests (understanding of these)
how patients were told about the cancer diagnosis (understanding, sensitivity, written information)
decisions on treatment (understanding, side effects explained, involvement in decision making, written information)
whether patients were given a named key worker (Cancer Nurse Specialist provision and experience of them)
support measures patients were informed about (information on support groups, financial help, free prescriptions)
hospital doctors (understanding, confidence and trust in them, knowledge of patient case)
ward nurses (understanding, confidence, availability)
overall hospital care and treatment (information provision, privacy, knowledge of case, pain control, dignity and respect)
information provided before going home (written information and understanding, information on care at home and health or social services provision)
day patient experience (radiotherapy, chemotherapy, side effects, pain control, emotional support, appointment delay, time with doctor, doctor notes and case understanding)
wider care experience (hospital and community staff working together, information transfer)
demographic data
information provided by the participating Trusts such as date of discharge, diagnosis etc.
Standard Measures: Positive scoring methodology was used to create individual question scores. The National Report used analysis of IMD deciles based on patients' postcodes provided as part of the dataset by individual NHS...
The PANORAMA Challenge: Public Training and Development Dataset (1)
zenodo.org
bin, zip
Updated Apr 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natália Alves; Natália Alves; Megan Schuurmans; Megan Schuurmans; Derya Yakar; Derya Yakar; Pierpaolo Vendittelli; Pierpaolo Vendittelli; Geert Litjens; Geert Litjens; John Hermans; Henkjan Huisman; Henkjan Huisman; John Hermans (2024). The PANORAMA Challenge: Public Training and Development Dataset (1) [Dataset]. http://doi.org/10.5281/zenodo.10998332
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10998332
Dataset updated
Apr 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Natália Alves; Natália Alves; Megan Schuurmans; Megan Schuurmans; Derya Yakar; Derya Yakar; Pierpaolo Vendittelli; Pierpaolo Vendittelli; Geert Litjens; Geert Litjens; John Hermans; Henkjan Huisman; Henkjan Huisman; John Hermans
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Apr 22, 2024
Description
This dataset represents the PANORAMA: Public Training and Development Dataset. It contains 2238 anonymized contrast-enhanced CT (CECT) scans from 2238 patients acquired at two centers (Radboud University Medical Center, University Medical Center Groningen) based in The Netherlands. Additionally, it contains 194 cases from the Medical Segmentation Decathlon dataset and 80 cases from National Institutes of Health. For all updates/fixes regarding this dataset, please join the challenge and check out our dedicated forum post on this topic. The corresponding labels of the PANORAMA dataset can be found here.

The PANORAMA challenge is an all-new grand challenge that aims to validate the diagnostic performance of artificial intelligence and radiologists at pancreatic ductal adenocarcinoma (PDAC) detection/diagnosis in CECT, with histopathology and follow-up (≥ 3 years) as the reference standard, in a retrospective setting in the hidden testing dataset. The study hypothesizes that state-of-the-art AI algorithms are non-inferior to radiologists reading CECT.

Key aspects of the PANORAMA study design have been established in conjunction with an international scientific advisory board of 13 experts in AI and pancreas radiology as well as a patient representative —to unify and standardize present-day guidelines, and to ensure meaningful validation of pancreas AI towards clinical translation (Reinke et al., 2021).

This PANORAMA dataset contains: batch 1 out of 4
f
Data from: Supplementary Material for: Hospital Volume and Mortality...
datasetcatalog.nlm.nih.gov
karger.figshare.com
Updated Nov 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. , Jo; N. , Michihata; H. , Matsui; Y. , Hiraishi; T. , Nagase; Y. , Sakamoto; Y. , Yamauchi; H. , Urushiyama; W. , Hasegawa; K. , Fushimi; H. , Yasunaga (2018). Supplementary Material for: Hospital Volume and Mortality following Diagnostic Bronchoscopy in Lung Cancer Patients: Data from a National Inpatient Database in Japan [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000647088
Explore at:
Dataset updated
Nov 8, 2018
Authors
T. , Jo; N. , Michihata; H. , Matsui; Y. , Hiraishi; T. , Nagase; Y. , Sakamoto; Y. , Yamauchi; H. , Urushiyama; W. , Hasegawa; K. , Fushimi; H. , Yasunaga
Description
Background: Recent advances in bronchoscopy utilizing endobronchial ultrasound (EBUS) as well as lung cancer therapy may have driven physicians to perform diagnostic bronchoscopy (DB) for high-risk patients. Objectives: The aim of this study was to clarify the relationship between hospital volume (HV) and outcomes of DB. Methods: We collected data on inpatients with lung cancer who underwent DB from July 2010 to March 31, 2014. The annual HV of DB was classified as “very low” (≤50 cases/year), “low” (51–100 cases/year), “high” (101–300 cases/year), or “very high” (> 300 cases/year). The primary outcome was all-cause 7-day mortality after DB. Multivariable logistic regression fitted with a generalized estimation equation was performed to evaluate the association between HV and all-cause 7-day mortality after DB, adjusted for patient background factors. Results: We identified a total of 77,755 eligible patients in 954 hospitals. All-cause 7-day mortality was 0.5%. Compared with the low-volume group, 7-day mortality was significantly lower in the high-volume group (odds ratio [OR] = 0.69, 95% confidence interval [CI]: 0.52–0.92, p = 0.010), and a similar trend was shown in the very-high-volume group (OR = 0.67; 95% CI: 0.43–1.05, p = 0.080). Radial EBUS with the guide sheath method and EBUS-guided transbronchial needle aspiration showed a significantly lower 7-day mortality. Conclusions: All-cause 7-day mortality was inversely associated with HV. The risk of DB in patients with lung cancer should be recognized, and the exploitation of EBUS may help reduce mortality after DB.
f
Data from: Health assistance path of women between diagnosis and treatment...
datasetcatalog.nlm.nih.gov
scielo.figshare.com
Updated Mar 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
de Carvalho, Priscila Guedes; Rodrigues, Nádia Cristina Pinheiro; O´Dwer, Gisele (2021). Health assistance path of women between diagnosis and treatment initiation for cervix cancer [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000826310
Explore at:
Dataset updated
Mar 23, 2021
Authors
de Carvalho, Priscila Guedes; Rodrigues, Nádia Cristina Pinheiro; O´Dwer, Gisele
Description
ABSTRACT This study aims to analyze the health assistance pathway of women living in Rio de Janeiro city diagnosed with cervix cancer who were referred for treatment in a referral oncology unit. In the first stage of the study, we evaluated time elapsed between the cancer diagnosis and the treatment initiation of women enrolled in 2014, taking as reference the time limit of 60 days established by the Brazilian Federal Law 12,372/2012 for treatment initiation at the Unified Health System (SUS). In the second stage, we analyzed the narratives of five women regarding their paths towards health services since the diagnosis up to the first therapeutic intervention, taking into account the aspects of comprehensive health care. It was observed that 88% of the treatments started after the 60-day legal period and that 65.5% of the women received a diagnosis in an advanced stage of the disease. The treatment initiation mean was 115.4 days. Main problems seized in path analysis concern the availability of services and the integration of actions throughout the different levels of health care, as well as the lack of information on the disease and the purpose of PAP smears.
O
COVID-19 case rate per 100,000 population and percent test positivity in the...
data.ct.gov
catalog.data.gov
csv, xlsx, xml
Updated Jun 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Public Health (2022). COVID-19 case rate per 100,000 population and percent test positivity in the last 14 days by town - ARCHIVE [Dataset]. https://data.ct.gov/widgets/hree-nys2
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Jun 23, 2022
Dataset authored and provided by
Department of Public Health
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Note: DPH is updating and streamlining the COVID-19 cases, deaths, and testing data. As of 6/27/2022, the data will be published in four tables instead of twelve.

The COVID-19 Cases, Deaths, and Tests by Day dataset contains cases and test data by date of sample submission. The death data are by date of death. This dataset is updated daily and contains information back to the beginning of the pandemic. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Cases-Deaths-and-Tests-by-Day/g9vi-2ahj.

The COVID-19 State Metrics dataset contains over 93 columns of data. This dataset is updated daily and currently contains information starting June 21, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-State-Level-Data/qmgw-5kp6 .

The COVID-19 County Metrics dataset contains 25 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-County-Level-Data/ujiq-dy22 .

The COVID-19 Town Metrics dataset contains 16 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Town-Level-Data/icxw-cada . To protect confidentiality, if a town has fewer than 5 cases or positive NAAT tests over the past 7 days, those data will be suppressed.

This dataset includes a count and rate per 100,000 population for COVID-19 cases, a count of COVID-19 molecular diagnostic tests, and a percent positivity rate for tests among people living in community settings for the previous two-week period. Dates are based on date of specimen collection (cases and positivity).

A person is considered a new case only upon their first COVID-19 testing result because a case is defined as an instance or bout of illness. If they are tested again subsequently and are still positive, it still counts toward the test positivity metric but they are not considered another case.

Percent positivity is calculated as the number of positive tests among community residents conducted during the 14 days divided by the total number of positive and negative tests among community residents during the same period. If someone was tested more than once during that 14 day period, then those multiple test results (regardless of whether they were positive or negative) are included in the calculation.

These case and test counts do not include cases or tests among people residing in congregate settings, such as nursing homes, assisted living facilities, or correctional facilities.

These data are updated weekly and reflect the previous two full Sunday-Saturday (MMWR) weeks (https://wwwn.cdc.gov/nndss/document/MMWR_week_overview.pdf).

DPH note about change from 7-day to 14-day metrics: Prior to 10/15/2020, these metrics were calculated using a 7-day average rather than a 14-day average. The 7-day metrics are no longer being updated as of 10/15/2020 but the archived dataset can be accessed here: https://data.ct.gov/Health-and-Human-Services/COVID-19-case-rate-per-100-000-population-and-perc/s22x-83rd

As you know, we are learning more about COVID-19 all the time, including the best ways to measure COVID-19 activity in our communities. CT DPH has decided to shift to 14-day rates because these are more stable, particularly at the town level, as compared to 7-day rates. In addition, since the school indicators were initially published by DPH last summer, CDC has recommended 14-day rates and other states (e.g., Massachusetts) have started to implement 14-day metrics for monitoring COVID transmission as well.

With respect to geography, we also have learned that many people are looking at the town-level data to inform decision making, despite emphasis on the county-level metrics in the published addenda. This is understandable as there has been variation within counties in COVID-19 activity (for example, rates that are higher in one town than in most other towns in the county).

Additional notes: As of 11/5/2020, CT DPH has added antigen testing for SARS-CoV-2 to reported test counts in this dataset. The tests included in this dataset include both molecular and antigen datasets. Molecular tests reported include polymerase chain reaction (PCR) and nucleic acid amplicfication (NAAT) tests.

The population data used to calculate rates is based on the CT DPH population statistics for 2019, which is available online here: https://portal.ct.gov/DPH/Health-Information-Systems--Reporting/Population/Population-Statistics. Prior to 5/10/2021, the population estimates from 2018 were used.

Data suppression is applied when the rate is <5 cases per 100,000 or if there are <5 cases within the town. Information on why data suppression rules are applied can be found online here: https://www.cdc.gov/cancer/uscs/technical_notes/stat_methods/suppression.htm
S
COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE
splitgraph.com
data.ct.gov
+2more
Updated Aug 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Public Health (2023). COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE [Dataset]. https://www.splitgraph.com/ct-gov/covid19-cases-and-deaths-by-raceethnicity-archive-7rne-efic/
Explore at:
application/openapi+json, json, application/vnd.splitgraph.imageAvailable download formats
Dataset updated
Aug 2, 2023
Dataset authored and provided by
Department of Public Health
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Note: DPH is updating and streamlining the COVID-19 cases, deaths, and testing data. As of 6/27/2022, the data will be published in four tables instead of twelve.

The COVID-19 Cases, Deaths, and Tests by Day dataset contains cases and test data by date of sample submission. The death data are by date of death. This dataset is updated daily and contains information back to the beginning of the pandemic. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Cases-Deaths-and-Tests-by-Day/g9vi-2ahj.

The COVID-19 State Metrics dataset contains over 93 columns of data. This dataset is updated daily and currently contains information starting June 21, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-State-Level-Data/qmgw-5kp6 .

The COVID-19 County Metrics dataset contains 25 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-County-Level-Data/ujiq-dy22 .

The COVID-19 Town Metrics dataset contains 16 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Town-Level-Data/icxw-cada . To protect confidentiality, if a town has fewer than 5 cases or positive NAAT tests over the past 7 days, those data will be suppressed.

COVID-19 cases and associated deaths that have been reported among Connecticut residents, broken down by race and ethnicity. All data in this report are preliminary; data for previous dates will be updated as new reports are received and data errors are corrected. Deaths reported to the either the Office of the Chief Medical Examiner (OCME) or Department of Public Health (DPH) are included in the COVID-19 update.

The following data show the number of COVID-19 cases and associated deaths per 100,000 population by race and ethnicity. Crude rates represent the total cases or deaths per 100,000 people. Age-adjusted rates consider the age of the person at diagnosis or death when estimating the rate and use a standardized population to provide a fair comparison between population groups with different age distributions. Age-adjustment is important in Connecticut as the median age of among the non-Hispanic white population is 47 years, whereas it is 34 years among non-Hispanic blacks, and 29 years among Hispanics. Because most non-Hispanic white residents who died were over 75 years of age, the age-adjusted rates are lower than the unadjusted rates. In contrast, Hispanic residents who died tend to be younger than 75 years of age which results in higher age-adjusted rates.

The population data used to calculate rates is based on the CT DPH population statistics for 2019, which is available online here: https://portal.ct.gov/DPH/Health-Information-Systems--Reporting/Population/Population-Statistics. Prior to 5/10/2021, the population estimates from 2018 were used.

Rates are standardized to the 2000 US Millions Standard population (data available here: https://seer.cancer.gov/stdpopulations/). Standardization was done using 19 age groups (0, 1-4, 5-9, 10-14, ..., 80-84, 85 years and older). More information about direct standardization for age adjustment is available here: https://www.cdc.gov/nchs/data/statnt/statnt06rv.pdf

Categories are mutually exclusive. The category “multiracial” includes people who answered ‘yes’ to more than one race category. Counts may not add up to total case counts as data on race and ethnicity may be missing. Age adjusted rates calculated only for groups with more than 20 deaths. Abbreviation: NH=Non-Hispanic.

Data on Connecticut deaths were obtained from the Connecticut Deaths Registry maintained by the DPH Office of Vital Records. Cause of death was determined by a death certifier (e.g., physician, APRN, medical

Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:

See the Splitgraph documentation for more information.
d
Health risk assessment of inhaled oil spill emissions with and without...
search.dataone.org
data.griidc.org
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koehler, Kirsten (2025). Health risk assessment of inhaled oil spill emissions with and without adding dispersant (due to volatile organic compounds) [Dataset]. http://doi.org/10.7266/N7RX99PN
Explore at:
Unique identifier
https://doi.org/10.7266/N7RX99PN
Dataset updated
Feb 5, 2025
Dataset provided by
GRIIDC
Authors
Koehler, Kirsten
Description
We performed laboratory measurements to record concentration of different volatile organic compounds (VOCs) emitted from a crude oil slick before and after premixing with dispersant. We input these concentrations into a health risk assessment model to estimate the cancer risk and hazard quotients based on USEPA-designated measures and reference concentrations. We targeted the health risk assessment of cleanup workers or residents nearby. Based on the results, the cancer risk of exposure to toluene and benzene reduced from 74 and 57 excess lifetime cancer cases per million for one hour per day of exposure continuing for 3 months to 66 and 37 (11% lower) excess lifetime cancer cases per million. Dispersant addition was effective in emission reduction of the lighter VOCs (up to 30% lower emission rate). However, hazard quotients of the non-carcinogenic VOCs even after dispersant addition were 2 to 3 orders of magnitude greater than 1 meaning that there are serious concerns about exposure to these VOCs.
c
The global Her2 Antibodies Market size will be USD 9351.4 million in 2025.
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). The global Her2 Antibodies Market size will be USD 9351.4 million in 2025. [Dataset]. https://www.cognitivemarketresearch.com/her2-antibodies-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global Her2 Antibodies Market size will be USD 9351.4 million in 2025. It will expand at a compound annual growth rate (CAGR) of 5.50% from 2025 to 2033.

North America held the major market share for more than 37% of the global revenue with a market size of USD 3460.02 million in 2025 and will grow at a compound annual growth rate (CAGR) of 14.0% from 2025 to 2033. Europe accounted for a market share of over 29% of the global revenue with a market size of USD 2711.91 million. APAC held a market share of around 24% of the global revenue with a market size of USD 2244.34 million in 2025 and will grow at a compound annual growth rate (CAGR) of 7.5% from 2025 to 2033. South America has a market share of more than 3.8% of the global revenue with a market size of USD 355.35 million in 2025 and will grow at a compound annual growth rate (CAGR) of 4.5% from 2025 to 2033. Middle East had a market share of around 4% of the global revenue and was estimated at a market size of USD 374.06 million in 2025 and will grow at a compound annual growth rate (CAGR) of 4.8% from 2025 to 2033. Africa had a market share of around 2.2% of the global revenue and was estimated at a market size of USD 205.73 million in 2025 and will grow at a compound annual growth rate (CAGR) of 5.2% from 2025 to 2033. Pertuzumab the fastest growing segment of the Her2 Antibodies Market industry

Market Dynamics of Her2 Antibodies Market

Key Drivers for Her2 Antibodies Market

Government Initiatives To Improve Breast Cancer Care And Treatment Boost her2 antibodies Market

Government initiatives to improve breast cancer care and treatment are expected to drive future growth in the HER2 antibody market. These initiatives include policies, funding, and programs to improve the accessibility, affordability, and quality of care for people with breast cancer. These initiatives advance patient care and outcomes by increasing access to HER2 antibodies, a critical treatment for HER2-positive breast cancer, via subsidized programs and research funding. For instance, in February 2023, the World Health Organization (WHO) released a new Global Breast Cancer Initiative Framework, which outlines a strategy for saving 2.5 million lives from breast cancer by 2040. As a result, government efforts to improve breast cancer care are driving the HER2 antibody market.

https://www.who.int/news/item/03-02-2023-who-launches-new-roadmap-on-breast-cancer”/

Increasing Incidence Of Breast Cancer Cases Fuels her2 antibodies Market

The rising global incidence of breast cancer cases is expected to drive growth in the HER2 antibodies market over the forecast period. For instance, in January 2022, the American Cancer Society predicted that there would be 1.9 million new cancer diagnoses and 609,360 cancer-related deaths in the United States, equating to approximately 1,670 deaths per day. Breast cancer is one of the four most common types of cancer worldwide, accounting for a sizable proportion of new cancer cases. As a result, the rise in global breast cancer incidence rates is expected to drive up demand for HER2 antibodies in the coming years.

https://www.cancer.org/research/cancer-facts-statistics/all-cancer-facts-figures/cancer-facts-figures-2022.html#:~:text=Estimated%20numbers%20of%20new%20cancer,factors%2C%20early%20detection%2C%20and%20treatment”/

Restraint Factor for the Her2 Antibodies Market

High cost of HER2 antibody therapies Limit Market Growth

The high cost of HER2 antibody treatments significantly impedes market expansion by limiting patient access to these life-changing therapies. Most patients with HER2-positive breast cancer may be discouraged by the cost of these therapies, forcing them to postpone or discontinue treatment. This is exacerbated in areas with limited healthcare coverage or inadequate insurance coverage for such advanced therapies. As a result, the premium price can create disparities in treatment access, affecting overall patient outcomes and limiting the market's growth potential. To increase patient access and market growth, HER2 antibody therapies must be made more affordable. Introduction of the Her2 Antibodies Market

HER2 antibodies are used in the treatment of HER2-positive breast cancer. HER2 is part of the human epidermal growth factor family. Overexpression of the HER2 oncogene causes the development and progression of some types...
Lifestyle choices of individuals with cancer in the United Kingdom 2014-2018...
statista.com
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Lifestyle choices of individuals with cancer in the United Kingdom 2014-2018 [Dataset]. https://www.statista.com/statistics/418807/lifestyle-of-individuals-with-cancer-in-the-united-kingdom/
Explore at:
Dataset updated
Nov 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United Kingdom
Description
This statistic depicts the lifestyle trends of adults diagnosed with cancer in the United Kingdom in 2014 and 2018. In 2018, 53 percent of adults with cancer did vigorous exercise of 20 minutes or more on at least one day per month. Simultaneously, 82 percent of adults that had been diagnosed with cancer drunk alcohol.
CBS News/New York Times Women's Health Poll, February 1997
icpsr.umich.edu
ascii, sas, spss +1
Updated Jan 31, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inter-university Consortium for Political and Social Research [distributor] (2007). CBS News/New York Times Women's Health Poll, February 1997 [Dataset]. http://doi.org/10.3886/ICPSR04487.v1
Explore at:
ascii, sas, spss, stataAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR04487.v1
Dataset updated
Jan 31, 2007
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
License
https://www.icpsr.umich.edu/web/ICPSR/studies/4487/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/4487/terms
Time period covered
Feb 1997
Area covered
United States
Description
This special topic poll, fielded February 18-19, 1997, is part of a continuing series of monthly surveys that solicit public opinion on the presidency and on a range of other political and social issues. The focus of this data collection was on women's health issues. Views were sought on whether government health agencies paid enough attention to women's health issues, and how well the federal government regulated the environmental practices of businesses and the safety of medical equipment and procedures. Respondents were asked to name the leading cause of death for women and whether they had ever heard of mammograms. Female respondents were polled on whether a doctor had ever discussed mammograms with them, whether they had ever had one, how accurate, safe, and painful they were, at which age women should begin getting mammograms, and whether the federal government should set guidelines for mammograms. Female respondents were also polled on the benefits of early detection of breast cancer and how often they conducted breast self-examinations. All respondents were polled on whether they had noticed the new television program ratings system, whether they had used the ratings to prohibit their children from watching certain television programs, and how many hours per day their children watched television. Additional topics addressed health insurance coverage, whether the respondent or a female relative was ever diagnosed with breast cancer, and whether respondents would like to take an "adventure" vacation. Demographic variables included sex, age, race, education level, household income, political party affiliation, political philosophy, type of residential area (e.g., urban or rural), and religious preference.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mikey-TraceGod (2025). Lung-Cancer-Risk-Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12844025

Lung-Cancer-Risk-Dataset

A Clean, Preprocessed Dataset with 50,000 Patient Profiles for Lung Cancer Risk

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/12844025

Dataset updated

Aug 23, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mikey-TraceGod

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Lung Cancer Risk Dataset

Overview

Rows: 50,000
Columns: 11
File: preprocessed_lung_cancer_dataset.csv
License: CC0: Public Domain

Dataset Description

Features

Column	Type	Description	Values/Range
patient_id	Integer	Unique patient identifier	100000-149999
age	Integer	Patient age in years	18-100
gender	String	Patient gender	'Male', 'Female'
pack_years	Float	Smoking exposure (years × packs per day)	0-100
radon_exposure	String	Residential radon exposure level	'Low', 'Medium', 'High'
asbestos_exposure	String	Occupational asbestos exposure history	'Yes', 'No'
secondhand_smoke_exposure	String	Passive smoking exposure	'Yes', 'No'
copd_diagnosis	String	Chronic obstructive pulmonary disease diagnosis	'Yes', 'No'
alcohol_consumption	String	Alcohol consumption pattern	'None', 'Moderate', 'Heavy'
family_history	String	Family history of lung cancer	'Yes', 'No'
lung_cancer	String	Target variable: Lung cancer diagnosis	'Yes', 'No'

Data Quality

Complete: No missing values or duplicates
Clean: All values within realistic ranges
Balanced Features: Realistic distribution of risk factors
Target Distribution: Approximately 25% positive cases, reflecting real-world lung cancer prevalence

Use Cases

Binary classification modeling
Risk factor correlation analysis
Data visualization and exploratory analysis
Machine learning pipeline development
Statistical hypothesis testing

Clear search

Close search

Google apps

Main menu

Lung-Cancer-Risk-Dataset

Lung Cancer Risk Dataset

Overview

Dataset Description

Features

Data Quality

Use Cases

The associations of sitting time and physical activity on total and...

Incidence of lung cancer in Europe in 2022, by country and gender

Data from: Identification of cancer chemotherapy regimens and patient...

Cancer Mortality in People Treated with Antidepressants before Cancer...

Data from: Data belonging to 'Smoking intensity and bladder cancer...

Cancer Registration: National Cancer Patient Experience Survey Wave 1 by...

Mortality rate from cancer in Russia 2023, by federal subject

Identifying Diseases Treatments in Healthcare Data

Identifying Entities (Diseases, Treatments) in Healthcare Data

📊 Understanding the Dataset

🔍 Example from Dataset:

Train Sentences:

Train Labels:

⚙️ Dataset Properties

🚧 Challenges Working with Dataset

🛠️ Approach to Extracting Insights from the Dataset

1. Data Preprocessing 🧹

2. Named Entity Recognition (NER) Model Development 🤖

3. Mapping Diseases to Treatments 🔄

Long-term inpatient disease burden in the Adult Life after Childhood Cancer...

National Cancer Patient Experience Survey, 2013-2014

The PANORAMA Challenge: Public Training and Development Dataset (1)

Data from: Supplementary Material for: Hospital Volume and Mortality...

Data from: Health assistance path of women between diagnosis and treatment...

COVID-19 case rate per 100,000 population and percent test positivity in the...

COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE

Health risk assessment of inhaled oil spill emissions with and without...

The global Her2 Antibodies Market size will be USD 9351.4 million in 2025.

Lifestyle choices of individuals with cancer in the United Kingdom 2014-2018...

CBS News/New York Times Women's Health Poll, February 1997

Lung-Cancer-Risk-Dataset

A Clean, Preprocessed Dataset with 50,000 Patient Profiles for Lung Cancer Risk

Lung Cancer Risk Dataset

Overview

Dataset Description

Features

Data Quality

Use Cases